R for Marketing Research and Analytics

Chris Chapman and Elea McDonnell Feit
January 2016

Chapter 6: Statistics to Compare Groups

Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html

Load the data (same as Chapter 5)

As always, see the book for details on the data simulation:

seg.df <- read.csv("http://goo.gl/qw303p")
summary(seg.df)
      age           gender        income            kids        ownHome   
 Min.   :19.26   Female:157   Min.   : -5183   Min.   :0.00   ownNo :159  
 1st Qu.:33.01   Male  :143   1st Qu.: 39656   1st Qu.:0.00   ownYes:141  
 Median :39.49                Median : 52014   Median :1.00               
 Mean   :41.20                Mean   : 50937   Mean   :1.27               
 3rd Qu.:47.90                3rd Qu.: 61403   3rd Qu.:2.00               
 Max.   :80.49                Max.   :114278   Max.   :7.00               
  subscribe         Segment   
 subNo :260   Moving up : 70  
 subYes: 40   Suburb mix:100  
              Travelers : 80  
              Urban hip : 50  


Chi-square test

Tests equality of marginal counts in groups. Important: compile a table first (don't use raw data). Then use chisq.test().

Let's see this for simple, fake data first:

tmp.tab <- table(rep(c(1:4), times=c(25,25,25,20)))
tmp.tab

 1  2  3  4 
25 25 25 20 
chisq.test(tmp.tab)

    Chi-squared test for given probabilities

data:  tmp.tab
X-squared = 0.78947, df = 3, p-value = 0.852

chisq.test "significant" and "not significant"

tmp.tab <- table(rep(c(1:4), times=c(25,25,25,20)))
chisq.test(tmp.tab)

    Chi-squared test for given probabilities

data:  tmp.tab
X-squared = 0.78947, df = 3, p-value = 0.852
tmp.tab <- table(rep(c(1:4), times=c(25,25,25,10)))
tmp.tab

 1  2  3  4 
25 25 25 10 
chisq.test(tmp.tab)

    Chi-squared test for given probabilities

data:  tmp.tab
X-squared = 7.9412, df = 3, p-value = 0.04724

Answers (Advanced, 4)

How does that answer compare to comparison of traditional ANOVA models?

aov1 <- aov(salary ~ rank + discipline + yrs.service,       
            data=Salaries)
aov2 <- aov(salary ~ rank + discipline + yrs.service + sex, 
            data=Salaries)
anova(aov1, aov2)
Analysis of Variance Table

Model 1: salary ~ rank + discipline + yrs.service
Model 2: salary ~ rank + discipline + yrs.service + sex
  Res.Df        RSS Df Sum of Sq      F Pr(>F)
1    392 2.0140e+11                           
2    391 2.0062e+11  1 776686259 1.5137 0.2193

Answers (Advanced, 5)

Plot the credible intervals for mean salary by rank in the Bayesian model (ignoring other effects).

salary.cidf <- data.frame(t(apply(salary.mc[, 2:4] + salary.mc[ , 1], 2, 
                                  quantile, pr=c(0.025, 0.5, 0.975))))
salary.cidf$rank <- rownames(salary.cidf)
library(ggplot2)
p <- ggplot(salary.cidf, aes(x=rank, 
                             y=X50., ymax=X97.5., ymin=X2.5.))
p <- p + geom_point(size=4) + geom_errorbar(width=0.2)
p + ggtitle("95% Credible Intervals for Mean Salary by Rank") + 
    coord_flip()

plot of chunk unnamed-chunk-43

Notes

This presentation is based on Chapter 6 of Chapman and Feit, R for Marketing Research and Analytics © 2015 Springer. http://r-marketing.r-forge.r-project.org/

Exercises here use the Salaries data set from the car package, John Fox and Sanford Weisberg (2011). An R Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

All code in the presentation is licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.