R for Marketing Research and Analytics
========================================================
Author: Chris Chapman and Elea McDonnell Feit
Date: January 2016
css: ../chapman-feit-slides.css
width: 1024
height: 768
**Chapter 6: Statistics to Compare Groups**
Website for all data files:
[http://r-marketing.r-forge.r-project.org/data.html](http://r-marketing.r-forge.r-project.org/data.html)
Load the data (same as Chapter 5)
=====
As always, see the book for details on the data simulation:
```{r}
seg.df <- read.csv("http://goo.gl/qw303p")
summary(seg.df)
```
Chi-square test
=====
Tests equality of marginal counts in groups. *Important*: compile a
**table** first (don't use raw data). Then use **`chisq.test()`**.
Let's see this for simple, fake data first:
```{r}
tmp.tab <- table(rep(c(1:4), times=c(25,25,25,20)))
tmp.tab
chisq.test(tmp.tab)
```
chisq.test "significant" and "not significant"
=====
```{r}
tmp.tab <- table(rep(c(1:4), times=c(25,25,25,20)))
chisq.test(tmp.tab)
tmp.tab <- table(rep(c(1:4), times=c(25,25,25,10)))
tmp.tab
chisq.test(tmp.tab)
```
chisq.test with segment data
====
Are the segments the same size? (Not a very interesting question, perhaps.)
```{r}
table(seg.df$Segment)
chisq.test(table(seg.df$Segment))
```
chisq.test with segment data
====
Do they have the same rate of subscription by ownership?
```{r}
table(seg.df$subscribe, seg.df$ownHome)
chisq.test(table(seg.df$subscribe, seg.df$ownHome))
```
Without correction (matches traditional formula):
```{r}
chisq.test(table(seg.df$subscribe, seg.df$ownHome), correct=FALSE)
```
Proportions: binomial test
=====
**`binom.test(x=successes, n=trials, p=proportion)`** tests whether the count
of *successes* in a certain number of *trials* matches an expected *proportion*:
```{r}
binom.test(12, 20, p=0.5)
```
Proportions: binomial test continued
=====
The same proportion with higher N can be significant:
```{r}
# binom.test(12, 20, p=0.5)
binom.test(120, 200, p=0.5)
```
See book for Agresti-Coull and other methods if your data has small N or is
near 0 or 1 proportion.
t-tests
=====
Does income differ for home owners in our data? A t-test compares the means of
two groups, relative to the variance. First let's visualize it:
```{r}
library(lattice)
bwplot(income ~ ownHome, data=seg.df)
```
t.test()
=====
Use formula syntax **`t.test(outcomeVar ~ groupingVar)`**:
```{r}
t.test(income ~ ownHome, data=seg.df)
```
Mean income is higher among home owners in our data, p < .01.
t.test() for a subset() of data
=====
**`subset()`** is an easy way to select portions of a data set. Here's the same
t-test but only for the Travelers segment:
```{r}
t.test(income ~ ownHome, data=subset(seg.df, Segment=="Travelers"))
```
Mean income is *not* significantly different between home owners and non-owners
in the Travelers segment.
ANOVA basics
=====
ANOVA (analysis of variance) compares the difference in means among two or more
groups, relative to their variances.
Recommended procedure: (1) fit a model with **`aov()`**. (2) use **`anova()`**
on the model object to obtain a typical ANOVA table.
For two groups it is effectively the same as a t-test:
```{r}
seg.aov.own <- aov(income ~ ownHome, data=seg.df)
anova(seg.aov.own)
```
ANOVA: Multiple groups
=====
The same model works for multiple groups. Make sure that the independent
variable is a **factor** (use **`factor()`** to convert if necessary).
```{r}
aggregate(income ~ Segment, mean, data=seg.df)
seg.aov.seg <- aov(income ~ Segment, data=seg.df)
anova(seg.aov.seg)
```
Multiple Effect Models: Basic Formulas
=====
Use formula syntax to add variables
Symbol | Meaning
-------- | --------
~ | RESPONSE ~ PREDICTOR(s)
+ | Additional main effect
: | Interaction without main effect
* | All main effects and their interactions
. | Shortcut for "all other variables"
**Examples:**
Model | Formula
--------------- | -----------------
income by ownership & segment | income ~ own + segment
income by interaaction of ownership by segment | income ~ own:segment
income with all effects of ownership with segment | income ~ own + segment + own:segment
... or the same but not as clear ... | income ~ own*segment
income as a response to all other variables | income ~ .
ANOVA with Segment + Ownership
=====
```{r}
anova(aov(income ~ Segment + ownHome, data=seg.df))
```
Mean income differs by *Segment*, but not by *ownership* after Segment is
controlled. Model by ownership alone might be misleading:
```{r}
anova(aov(income ~ ownHome, data=seg.df))
```
ANOVA with interaction
=====
```{r}
anova(aov(income ~ Segment * ownHome, data=seg.df))
```
Mean income differs by *segment*, not by *ownership*, and not by the
*interaction* of ownership with segment.
Recommended: instead of using * , specify all effects directly:
```{r, eval=FALSE}
anova(aov(income ~ Segment + ownHome + Segment:ownHome, data=seg.df))
```
Model Comparison
=====
The **`anova()`** command will also compare the fit of models:
```{r}
anova(aov(income ~ Segment, data=seg.df),
aov(income ~ Segment + ownHome, data=seg.df))
```
In this case, once we model income by segment, adding ownership does not
improve the model fit.
Visualization: ANOVA Group means
=====
Use **glht()** in **multcomp** package as an easy way to get mean and CI:
```{r}
# install.packages("multcomp") # if needed
library(multcomp)
seg.aov <- aov(income ~ -1 + Segment, data=seg.df) # model w/o int.
by.seg <- glht(seg.aov) # means and CIs
plot(by.seg, xlab="Income", main="Mean Income by Segment (95% CI)")
```
Exercises (Basic)
=====
Access the `Salaries` data set:
```{r}
library(car) # install.packages("car") if needed
data(Salaries)
```
1. Does the proportion of women differ by discipline?
2. In a one-way ANOVA, are salaries different for men and women?
3. Visualize the mean salary for men and women, with 95% confidence intervals.
Answers (1)
=====
Does the proportion of women differ by discipline?
```{r}
with(Salaries, prop.table(table(discipline, sex), margin=1))
with(Salaries, chisq.test(table(discipline, sex)))
```
Answers (2)
=====
In a one-way ANOVA, are salaries different for men and women?
```{r}
aggregate(salary ~ sex, data=Salaries, mean)
anova(aov(salary ~ sex, data=Salaries))
```
Answers (3)
=====
Visualize the mean salary for men and women, with 95% confidence intervals.
```{r}
# install.packages("multcomp") # if needed
library(multcomp)
salary.aov <- aov(salary ~ -1 + sex, data=Salaries)
by.sex <- glht(salary.aov)
plot(by.sex, xlab="Salary", main="Mean Salary with 95% CI")
```
Extra Slides
=====
type: section
- Stepwise ANOVA
- Bayesian ANOVA
- Advanced Exercises
Optional: Stepwise ANOVA
=====
Use **step()** to do forward or (default) backward stepwise model fit. This
fits models and successively drops (or adds) variables to see if fit improves.
Returing to the segmentation data:
```{r}
seg.aov.step <- step(aov(income ~ ., data=seg.df))
```
Stepwise ANOVA: Result
=====
```{r, eval=FALSE}
seg.aov.step <- step(aov(income ~ ., data=seg.df))
```
After **step()** tests all main effect variables in the model (**income ~ .**),
the best-fitting model balancing fit and complexity is:
```{r}
anova(seg.aov.step)
```
More Advanced: Bayesian ANOVA
=====
There are many R packages that support Bayesian inference. Simple example for ANOVA. Fit two models:
```{r}
# install.packages("BayesFactor") # if needed
library(BayesFactor)
set.seed(96761) # optional, for replication
seg.bf1 <- lmBF(income ~ Segment, data=seg.df)
seg.bf2 <- lmBF(income ~ Segment + ownHome, data=seg.df)
```
And compare them for which has more evidence in the data:
```{r}
seg.bf1 / seg.bf2
```
Model 1 has "6x" as much evidence, considering just these two models.
Bayesian ANOVA: Under the hood
=====
The model works by drawing 1000s of likely estimates for the model parameters
(overall mean and segment means). We can examine:
```{r}
seg.bf.chain <- posterior(seg.bf1, 1, iterations = 10000)
head(seg.bf.chain[, 1:4])
```
Segment estimates are deviations from the overall mean estimate.
Bayesian ANOVA: Plotting the Draws
=====
```{r}
plot(seg.bf.chain[, 1:2]) # overall mean + first segment
```
Bayesian ANOVA: Segment Estimates
=====
Segment mean estimate = overall mean + segment deviation
```{r}
seg.bf.chain[1:4, 1:4]
seg.bf.chain[1:4, 2:4] + seg.bf.chain[1:4, 1]
```
Bayesian ANOVA: Segment CIs
=====
First get the segment estimates for each of 10000 draws:
```{r}
seg.bf.chain.total <- seg.bf.chain[, 2:5] + seg.bf.chain[, 1]
seg.bf.chain.total[1:4, 1:3]
```
Then get the 95% credible intervals that we observe there:
```{r}
seg.bf.ci <- t(apply(seg.bf.chain.total, 2,
quantile, pr=c(0.025, 0.5, 0.975)))
seg.bf.ci
```
Bayesian ANOVA: Plot the CIs
=====
Make a data frame of the CI results:
```{r}
seg.bf.df <- data.frame(seg.bf.ci)
seg.bf.df$Segment <- rownames(seg.bf.df)
```
And use that to build the plot:
```{r}
library(ggplot2)
# basic plot object with CIs on Y axis by Segment on X
p <- ggplot(seg.bf.df, aes(x=Segment,
y=X50., ymax=X97.5., ymin=X2.5.))
# add points for the Y var and error bars for ymax, ymin
p <- p + geom_point(size=4) + geom_errorbar(width=0.2)
# add a title and rotate the plot to horizontal
p <- p +
ggtitle("95% CI for Mean Income by Segment") + coord_flip()
```
So what happened? We built a plot object! Now ...
Plot it
=====
```{r}
p
```
Exercises (Advanced)
=====
Access the `Salaries` data set:
```{r}
library(car) # install.packages("car") if needed
data(Salaries)
```
1. Using a stepwise ANOVA, which predictors are associated with salary?
2. Using the predictors identified by stepwise ANOVA, do a Bayesian ANOVA model.
What are the 95% credible intervals for the main effects?
3. Is the Bayesian model improved if sex is included as a linear predictor?
4. How does that answer compare to comparison of traditional ANOVA models?
5. Plot the credible intervals for salary by rank in the Bayesian model (ignoring other effects).
Answers (Advanced, 1)
=====
Using a stepwise ANOVA, which predictors are associated with salary?
```{r, results="hide"}
salary.step <- step(aov(salary ~ ., data=Salaries)) # output hidden
```
```{r}
anova(salary.step)
```
Answers (Advanced, 2)
=====
Using the predictors identified by stepwise ANOVA, do a Bayesian ANOVA model.
What are the 95% credible intervals for the main effects?
```{r, results="hidden"}
library(BayesFactor)
set.seed(96761) # optional for replication
salary.b <- lmBF(salary ~ rank + discipline + yrs.service,
data=Salaries)
salary.mc <- posterior(salary.b, 1, iterations=10000)
t(apply(salary.mc[, 1:7], 2, quantile, pr=c(0.025, 0.5, 0.975)))
```
Answers (Advanced, 3)
=====
Is the Bayesian model improved if sex is included as a linear predictor?
```{r, results="hidden"}
salary.b2 <- lmBF(salary ~ rank + discipline + yrs.service + sex,
data=Salaries)
salary.b2 / salary.b
```
Answers (Advanced, 4)
=====
How does that answer compare to comparison of traditional ANOVA models?
```{r}
aov1 <- aov(salary ~ rank + discipline + yrs.service,
data=Salaries)
aov2 <- aov(salary ~ rank + discipline + yrs.service + sex,
data=Salaries)
anova(aov1, aov2)
```
Answers (Advanced, 5)
=====
Plot the credible intervals for mean salary by rank in the Bayesian model (ignoring other effects).
```{r}
salary.cidf <- data.frame(t(apply(salary.mc[, 2:4] + salary.mc[ , 1], 2,
quantile, pr=c(0.025, 0.5, 0.975))))
salary.cidf$rank <- rownames(salary.cidf)
library(ggplot2)
p <- ggplot(salary.cidf, aes(x=rank,
y=X50., ymax=X97.5., ymin=X2.5.))
p <- p + geom_point(size=4) + geom_errorbar(width=0.2)
p + ggtitle("95% Credible Intervals for Mean Salary by Rank") +
coord_flip()
```
Notes
========
This presentation is based on Chapter 6 of Chapman and Feit, *R for Marketing Research and Analytics* © 2015 Springer. http://r-marketing.r-forge.r-project.org/
Exercises here use the `Salaries` data set from the `car` package, John Fox and Sanford Weisberg (2011). *An R Companion to Applied Regression*, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
All code in the presentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.