R for Marketing Research and Analytics

Chris Chapman and Elea McDonnell Feit
January 2016

Chapter 3: Describing Data (Descriptive Statistics)

Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html

Load the data

The book walks through simulation of nearly all the data sets … check that out, as there is much more about R in those sections.

For today, we'll just load the data from the website:

store.df <- read.csv("http://goo.gl/QPDdMl")
summary(store.df)
    storeNum          Year          Week          p1sales   
 Min.   :101.0   Min.   :1.0   Min.   : 1.00   Min.   : 73  
 1st Qu.:105.8   1st Qu.:1.0   1st Qu.:13.75   1st Qu.:113  
 Median :110.5   Median :1.5   Median :26.50   Median :129  
 Mean   :110.5   Mean   :1.5   Mean   :26.50   Mean   :133  
 3rd Qu.:115.2   3rd Qu.:2.0   3rd Qu.:39.25   3rd Qu.:150  
 Max.   :120.0   Max.   :2.0   Max.   :52.00   Max.   :263  

    p2sales         p1price         p2price         p1prom   
 Min.   : 51.0   Min.   :2.190   Min.   :2.29   Min.   :0.0  
 1st Qu.: 84.0   1st Qu.:2.290   1st Qu.:2.49   1st Qu.:0.0  
 Median : 96.0   Median :2.490   Median :2.59   Median :0.0  
 Mean   :100.2   Mean   :2.544   Mean   :2.70   Mean   :0.1  
 3rd Qu.:113.0   3rd Qu.:2.790   3rd Qu.:2.99   3rd Qu.:0.0  
 Max.   :225.0   Max.   :2.990   Max.   :3.19   Max.   :1.0  

     p2prom       country 
 Min.   :0.0000   AU:104  
 1st Qu.:0.0000   BR:208  
 Median :0.0000   CN:208  
 Mean   :0.1385   DE:520  
 3rd Qu.:0.0000   GB:312  
 Max.   :1.0000   JP:416  
                  US:312  

Descriptives 1

table() for categorical variable

table(store.df$p1price)

2.19 2.29 2.49 2.79 2.99 
 395  444  423  443  375 

The counts can be converted to proportions with prop.table()

prop.table(table(store.df$p1price))

     2.19      2.29      2.49      2.79      2.99 
0.1899038 0.2134615 0.2033654 0.2129808 0.1802885 

Table as an object

Tables are objects that can be assigned and indexed:

p1.table <- table(store.df$p1price)
p1.table

2.19 2.29 2.49 2.79 2.99 
 395  444  423  443  375 
p1.table[3]
2.49 
 423 
str(p1.table)
 'table' int [1:5(1d)] 395 444 423 443 375
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:5] "2.19" "2.29" "2.49" "2.79" ...

Plotting a table (basic)

plot(p1.table)

plot of chunk unnamed-chunk-5

We'll see better plots later!

Two-way tables

table(store.df$p1price, store.df$p1prom)

         0   1
  2.19 354  41
  2.29 398  46
  2.49 381  42
  2.79 396  47
  2.99 343  32

Note that tables index [row, column] like most things in R!

Core Descriptive Functions

min(store.df$p1sales)
[1] 73
max(store.df$p2sales)
[1] 225
mean(store.df$p1prom)
[1] 0.1
median(store.df$p2sales)
[1] 96
var(store.df$p1sales)
[1] 805.0044
sd(store.df$p1sales)
[1] 28.3726
IQR(store.df$p1sales)
[1] 37
mad(store.df$p1sales)
[1] 26.6868

Percentile (Quantile) function

quantile(store.df$p1sales)   # default = 0:4*0.25
  0%  25%  50%  75% 100% 
  73  113  129  150  263 
quantile(store.df$p1sales, probs=c(0.25, 0.75)) # Interquartile
25% 75% 
113 150 
quantile(store.df$p1sales, probs=c(0.025, 0.975)) # central 95%
 2.5% 97.5% 
   88   199 
quantile(store.df$p1sales, probs=1:10/10)  # shortcut
  10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
100.0 109.0 117.0 122.6 129.0 136.0 145.0 156.0 171.0 263.0 

Summary of data frame

summary(store.df)
    storeNum          Year          Week          p1sales   
 Min.   :101.0   Min.   :1.0   Min.   : 1.00   Min.   : 73  
 1st Qu.:105.8   1st Qu.:1.0   1st Qu.:13.75   1st Qu.:113  
 Median :110.5   Median :1.5   Median :26.50   Median :129  
 Mean   :110.5   Mean   :1.5   Mean   :26.50   Mean   :133  
 3rd Qu.:115.2   3rd Qu.:2.0   3rd Qu.:39.25   3rd Qu.:150  
 Max.   :120.0   Max.   :2.0   Max.   :52.00   Max.   :263  

    p2sales         p1price         p2price         p1prom   
 Min.   : 51.0   Min.   :2.190   Min.   :2.29   Min.   :0.0  
 1st Qu.: 84.0   1st Qu.:2.290   1st Qu.:2.49   1st Qu.:0.0  
 Median : 96.0   Median :2.490   Median :2.59   Median :0.0  
 Mean   :100.2   Mean   :2.544   Mean   :2.70   Mean   :0.1  
 3rd Qu.:113.0   3rd Qu.:2.790   3rd Qu.:2.99   3rd Qu.:0.0  
 Max.   :225.0   Max.   :2.990   Max.   :3.19   Max.   :1.0  

     p2prom       country 
 Min.   :0.0000   AU:104  
 1st Qu.:0.0000   BR:208  
 Median :0.0000   CN:208  
 Mean   :0.1385   DE:520  
 3rd Qu.:0.0000   GB:312  
 Max.   :1.0000   JP:416  
                  US:312  

Summary of data frame elements

summary(store.df$p1sales)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     73     113     129     133     150     263 
summary(store.df$p1sales, digits=2)  # round output
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     73     110     130     130     150     260 

Visualization: Steps to Prettify (1)

hist() for basic plot

hist(store.df$p1sales)

plot of chunk unnamed-chunk-12

Improve it with labels

hist(store.df$p1sales, 
     main="Product 1 Weekly Sales Frequencies, All Stores",
     xlab="Product 1 Sales (Units)",
     ylab="Count" )           

plot of chunk unnamed-chunk-13

Make it more granular and colorful

hist(store.df$p1sales, 
     main="Product 1 Weekly Sales Frequencies, All Stores",
     xlab="Product 1 Sales (Units)",
     ylab="Count",
     breaks=30,             # more columns 
     col="lightblue")       # color the bars

plot of chunk unnamed-chunk-14

Change counts to proportions

hist(store.df$p1sales, 
     main="Product 1 Weekly Sales Frequencies, All Stores",
     xlab="Product 1 Sales (Units)",
     ylab="Relative frequency", # changed
     breaks=30, 
     col="lightblue", 
     freq=FALSE )                # freq=FALSE for density

plot of chunk unnamed-chunk-15

Add density curve

hist(store.df$p1sales, 
     main="Product 1 Weekly Sales Frequencies, All Stores",
     xlab="Product 1 Sales", ylab="Relative frequency",
     breaks=30, col="lightblue", freq=FALSE)

lines(density(store.df$p1sales, bw=10),  # bw = smoothing
      type="l", col="darkred", lwd=2)    # lwd = line width