R for Marketing Research and Analytics ======================================================== Author: Chris Chapman and Elea McDonnell Feit Date: January 2016 css: ../chapman-feit-slides.css width: 1024 height: 768 **Chapter 3: Describing Data (Descriptive Statistics)** Website for all data files: [http://r-marketing.r-forge.r-project.org/data.html](http://r-marketing.r-forge.r-project.org/data.html) Load the data ======== The book walks through **simulation** of nearly all the data sets ... check that out, as there is much more about R in those sections. For today, we'll just load the data from the website: ```{r} store.df <- read.csv("http://goo.gl/QPDdMl") summary(store.df) ``` Descriptives 1 ======== table() for categorical variable ```{r} table(store.df$p1price) ``` The counts can be converted to proportions with prop.table() ```{r} prop.table(table(store.df$p1price)) ``` Table as an object ======== Tables are objects that can be assigned and indexed: ```{r} p1.table <- table(store.df$p1price) p1.table p1.table[3] str(p1.table) ``` Plotting a table (basic) ======== ```{r} plot(p1.table) ``` We'll see better plots later! Two-way tables ======== ```{r} table(store.df$p1price, store.df$p1prom) ``` Note that tables index [row, column] like most things in R! Core Descriptive Functions ======== ```{r} min(store.df$p1sales) max(store.df$p2sales) mean(store.df$p1prom) median(store.df$p2sales) ``` *** ```{r} var(store.df$p1sales) sd(store.df$p1sales) IQR(store.df$p1sales) mad(store.df$p1sales) ``` Percentile (Quantile) function ======== ```{r} quantile(store.df$p1sales) # default = 0:4*0.25 quantile(store.df$p1sales, probs=c(0.25, 0.75)) # Interquartile quantile(store.df$p1sales, probs=c(0.025, 0.975)) # central 95% quantile(store.df$p1sales, probs=1:10/10) # shortcut ``` Summary of data frame ======== ```{r} summary(store.df) ``` Summary of data frame elements ======== ```{r} summary(store.df$p1sales) summary(store.df$p1sales, digits=2) # round output ``` Visualization: Steps to Prettify (1) ======== hist() for basic plot ```{r} hist(store.df$p1sales) ``` Improve it with labels ======== ```{r} hist(store.df$p1sales, main="Product 1 Weekly Sales Frequencies, All Stores", xlab="Product 1 Sales (Units)", ylab="Count" ) ``` Make it more granular and colorful ======== ```{r} hist(store.df$p1sales, main="Product 1 Weekly Sales Frequencies, All Stores", xlab="Product 1 Sales (Units)", ylab="Count", breaks=30, # more columns col="lightblue") # color the bars ``` Change counts to proportions ======== ```{r} hist(store.df$p1sales, main="Product 1 Weekly Sales Frequencies, All Stores", xlab="Product 1 Sales (Units)", ylab="Relative frequency", # changed breaks=30, col="lightblue", freq=FALSE ) # freq=FALSE for density ``` Add density curve ======== ```{r} hist(store.df$p1sales, main="Product 1 Weekly Sales Frequencies, All Stores", xlab="Product 1 Sales", ylab="Relative frequency", breaks=30, col="lightblue", freq=FALSE) lines(density(store.df$p1sales, bw=10), # bw = smoothing type="l", col="darkred", lwd=2) # lwd = line width ``` # boxplots and stripcharts ## boxplot Boxplot ======== Basic boxplot is good for data exploration: ```{r} boxplot(store.df$p2sales, xlab="Weekly sales", ylab="P2", main="Weekly sales of P2, All stores", horizontal=TRUE) ``` Boxplot broken out by factor ======== Plot **DV ~ IV** to condition on a factor: ```{r} boxplot(store.df$p2sales ~ store.df$storeNum, horizontal=TRUE, ylab="Store", xlab="Weekly unit sales", las=1, main="Weekly Sales of P2 by Store") ``` Boxplot with data and axes ======== Use **data=** to specify df, and **axis** to label axes better: ```{r} boxplot(p2sales ~ p2prom, data=store.df, horizontal=TRUE, yaxt="n", ylab="P2 promoted in store?", xlab="Weekly sales", main="Sales of P2 vs promotion") axis(side=2, at=c(1,2), labels=c("No", "Yes"), las=1) ``` See the book for more ======== type: alert 1. QQ plots to check distribution 2. QQ plots for raw vs. transformed data 3. Cumulative distribution plots, such as: ```{r, echo=FALSE} plot(ecdf(store.df$p1sales), main="Cumulative distribution of P1 Weekly Sales", ylab="Cumulative Proportion", xlab=c("P1 weekly sales, all stores", "90% of weeks sold <= 171 units"), yaxt="n") axis(side=2, at=seq(0, 1, by=0.1), las=1, labels=paste(seq(0,100,by=10), "%", sep="")) # add lines for 90% abline(h=0.9, lty=3) abline(v=quantile(store.df$p1sales, pr=0.9), lty=3) ``` by() ======== **`by()`** is one way to split data by a factor and apply a function to each group: ```{r} by(store.df$p1sales, store.df$storeNum, mean) ``` by(store.df$p1sales, list(store.df$storeNum, store.df$Year), mean) aggregate() ======== `aggregate()` collects results similar to `by()` into an object: ```{r} storeMean <- aggregate(store.df$p1sales, by=list(store=store.df$storeNum), mean) storeMean ``` Exercise! ======= Access the `Salaries` data set: ```{r} library(car) # install.packages("car") if needed data(Salaries) ``` 1. What are the counts of men and women by rank? The proportions? 2. Draw a histogram for years of service. Add a density line in red. (Hint: plot proportions, not frequency.) 3. Draw a box plot for salary. 4. Draw a box plot for salary by rank. Make it horizontal. Answers (1) ======= What are the counts of men and women by rank? The proportions? ```{r} table(Salaries$rank, Salaries$sex) prop.table(table(Salaries$rank, Salaries$sex)) prop.table(table(Salaries$rank, Salaries$sex), margin=2) ``` Answers (2) ======= Draw a histogram for years of service. Add a density line in red. ```{r} hist(Salaries$yrs.service, freq=FALSE) lines(density(Salaries$yrs.service), col="red") ``` Answers (3) ======= Draw a box plot for salary ```{r} boxplot(Salaries$salary) ``` Answers (4) ======= Draw a box plot for salary by rank. Make it horizontal. ```{r} boxplot(Salaries$salary ~ Salaries$rank, horizontal=TRUE) ``` Optional Slides ===== type: section - Indexing with tables - Using `describe()` esp. for survey data - Easy choropleth world map Indexing with tables ======== ```{r} table(store.df$p1price, store.df$p1prom) ``` Note that tables index [row, column] like most things in R! Two-way tables are also assignable \& indexable: ```{r} p1.table2 <- table(store.df$p1price, store.df$p1prom) p1.table2[, 2] / (p1.table2[, 1] + p1.table2[, 2]) ``` Describe (psych package) ======== **`describe()`** is especially useful for survey data. ```{r, size="tiny"} library(psych) # must install first describe(store.df) ``` Aggregate sales by country ======== ```{r} p1sales.sum <- aggregate(store.df$p1sales, by=list(country=store.df$country), sum) p1sales.sum ``` Plot sales by country with rworldmap() ======== First we load the packages, and set up a map. In our aggregated data, we use the **country** column to tell the map where to put the **p1sales.sum** aggregated mean. ```{r} library(rworldmap) # must be installed library(RColorBrewer) # must be installed p1sales.map <- joinCountryData2Map(p1sales.sum, joinCode = "ISO2", nameJoinColumn = "country") ``` Draw the map ======== Once the data is "mapped" to the locations, we can draw the visualization: ```{r} mapCountryData(p1sales.map, nameColumnToPlot="x", mapTitle="Total P1 sales by Country", colourPalette=brewer.pal(7, "Greens"), catMethod="fixedWidth", addLegend=FALSE) ``` That's all for Chapter 3! ========= type: section # Break time Notes ======== This presentation is based on Chapter 6 of Chapman and Feit, *R for Marketing Research and Analytics* © 2015 Springer. http://r-marketing.r-forge.r-project.org/ Exercises here use the `Salaries` data set from the `car` package, John Fox and Sanford Weisberg (2011). *An R Companion to Applied Regression*, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion All code in the presentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.