Chris Chapman and Elea McDonnell Feit
January 2016
Chapter 3: Describing Data (Descriptive Statistics)
Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html
The book walks through simulation of nearly all the data sets … check that out, as there is much more about R in those sections.
For today, we'll just load the data from the website:
store.df <- read.csv("http://goo.gl/QPDdMl")
summary(store.df)
storeNum Year Week p1sales
Min. :101.0 Min. :1.0 Min. : 1.00 Min. : 73
1st Qu.:105.8 1st Qu.:1.0 1st Qu.:13.75 1st Qu.:113
Median :110.5 Median :1.5 Median :26.50 Median :129
Mean :110.5 Mean :1.5 Mean :26.50 Mean :133
3rd Qu.:115.2 3rd Qu.:2.0 3rd Qu.:39.25 3rd Qu.:150
Max. :120.0 Max. :2.0 Max. :52.00 Max. :263
p2sales p1price p2price p1prom
Min. : 51.0 Min. :2.190 Min. :2.29 Min. :0.0
1st Qu.: 84.0 1st Qu.:2.290 1st Qu.:2.49 1st Qu.:0.0
Median : 96.0 Median :2.490 Median :2.59 Median :0.0
Mean :100.2 Mean :2.544 Mean :2.70 Mean :0.1
3rd Qu.:113.0 3rd Qu.:2.790 3rd Qu.:2.99 3rd Qu.:0.0
Max. :225.0 Max. :2.990 Max. :3.19 Max. :1.0
p2prom country
Min. :0.0000 AU:104
1st Qu.:0.0000 BR:208
Median :0.0000 CN:208
Mean :0.1385 DE:520
3rd Qu.:0.0000 GB:312
Max. :1.0000 JP:416
US:312
table() for categorical variable
table(store.df$p1price)
2.19 2.29 2.49 2.79 2.99
395 444 423 443 375
The counts can be converted to proportions with prop.table()
prop.table(table(store.df$p1price))
2.19 2.29 2.49 2.79 2.99
0.1899038 0.2134615 0.2033654 0.2129808 0.1802885
Tables are objects that can be assigned and indexed:
p1.table <- table(store.df$p1price)
p1.table
2.19 2.29 2.49 2.79 2.99
395 444 423 443 375
p1.table[3]
2.49
423
str(p1.table)
'table' int [1:5(1d)] 395 444 423 443 375
- attr(*, "dimnames")=List of 1
..$ : chr [1:5] "2.19" "2.29" "2.49" "2.79" ...
plot(p1.table)
We'll see better plots later!
table(store.df$p1price, store.df$p1prom)
0 1
2.19 354 41
2.29 398 46
2.49 381 42
2.79 396 47
2.99 343 32
Note that tables index [row, column] like most things in R!
min(store.df$p1sales)
[1] 73
max(store.df$p2sales)
[1] 225
mean(store.df$p1prom)
[1] 0.1
median(store.df$p2sales)
[1] 96
var(store.df$p1sales)
[1] 805.0044
sd(store.df$p1sales)
[1] 28.3726
IQR(store.df$p1sales)
[1] 37
mad(store.df$p1sales)
[1] 26.6868
quantile(store.df$p1sales) # default = 0:4*0.25
0% 25% 50% 75% 100%
73 113 129 150 263
quantile(store.df$p1sales, probs=c(0.25, 0.75)) # Interquartile
25% 75%
113 150
quantile(store.df$p1sales, probs=c(0.025, 0.975)) # central 95%
2.5% 97.5%
88 199
quantile(store.df$p1sales, probs=1:10/10) # shortcut
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
100.0 109.0 117.0 122.6 129.0 136.0 145.0 156.0 171.0 263.0
summary(store.df)
storeNum Year Week p1sales
Min. :101.0 Min. :1.0 Min. : 1.00 Min. : 73
1st Qu.:105.8 1st Qu.:1.0 1st Qu.:13.75 1st Qu.:113
Median :110.5 Median :1.5 Median :26.50 Median :129
Mean :110.5 Mean :1.5 Mean :26.50 Mean :133
3rd Qu.:115.2 3rd Qu.:2.0 3rd Qu.:39.25 3rd Qu.:150
Max. :120.0 Max. :2.0 Max. :52.00 Max. :263
p2sales p1price p2price p1prom
Min. : 51.0 Min. :2.190 Min. :2.29 Min. :0.0
1st Qu.: 84.0 1st Qu.:2.290 1st Qu.:2.49 1st Qu.:0.0
Median : 96.0 Median :2.490 Median :2.59 Median :0.0
Mean :100.2 Mean :2.544 Mean :2.70 Mean :0.1
3rd Qu.:113.0 3rd Qu.:2.790 3rd Qu.:2.99 3rd Qu.:0.0
Max. :225.0 Max. :2.990 Max. :3.19 Max. :1.0
p2prom country
Min. :0.0000 AU:104
1st Qu.:0.0000 BR:208
Median :0.0000 CN:208
Mean :0.1385 DE:520
3rd Qu.:0.0000 GB:312
Max. :1.0000 JP:416
US:312
summary(store.df$p1sales)
Min. 1st Qu. Median Mean 3rd Qu. Max.
73 113 129 133 150 263
summary(store.df$p1sales, digits=2) # round output
Min. 1st Qu. Median Mean 3rd Qu. Max.
73 110 130 130 150 260
hist() for basic plot
hist(store.df$p1sales)
hist(store.df$p1sales,
main="Product 1 Weekly Sales Frequencies, All Stores",
xlab="Product 1 Sales (Units)",
ylab="Count" )
hist(store.df$p1sales,
main="Product 1 Weekly Sales Frequencies, All Stores",
xlab="Product 1 Sales (Units)",
ylab="Count",
breaks=30, # more columns
col="lightblue") # color the bars
hist(store.df$p1sales,
main="Product 1 Weekly Sales Frequencies, All Stores",
xlab="Product 1 Sales (Units)",
ylab="Relative frequency", # changed
breaks=30,
col="lightblue",
freq=FALSE ) # freq=FALSE for density
hist(store.df$p1sales,
main="Product 1 Weekly Sales Frequencies, All Stores",
xlab="Product 1 Sales", ylab="Relative frequency",
breaks=30, col="lightblue", freq=FALSE)
lines(density(store.df$p1sales, bw=10), # bw = smoothing
type="l", col="darkred", lwd=2) # lwd = line width