Chris Chapman and Elea McDonnell Feit
January 2016
Chapter 4: Relationships between Variables (Bivariate Statistics)
Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html
As always, see the book for details about data simulation. Meanwhile, we'll load it. This is example data with data on customers' visits, transactions, and spending for online and retail purchases:
cust.df <- read.csv("http://goo.gl/PmPkaG")
str(cust.df)
'data.frame': 1000 obs. of 12 variables:
$ cust.id : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 22.9 28 35.9 30.5 38.7 ...
$ credit.score : num 631 749 733 830 734 ...
$ email : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
$ distance.to.store: num 2.58 48.18 1.29 5.25 25.04 ...
$ online.visits : int 20 121 39 1 35 1 1 48 0 14 ...
$ online.trans : int 3 39 14 0 11 1 1 13 0 6 ...
$ online.spend : num 58.4 756.9 250.3 0 204.7 ...
$ store.trans : int 4 0 0 2 0 0 2 4 0 3 ...
$ store.spend : num 140.3 0 0 95.9 0 ...
$ sat.service : int 3 3 NA 4 1 NA 3 2 4 3 ...
$ sat.selection : int 3 3 NA 2 1 NA 3 3 2 2 ...
Text data is automatically converted to factors when reading CSVs. However, sometimes data that appears to be numeric is really not.
The factor() function will convert data to nominal factors:
str(cust.df$cust.id)
int [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
cust.df$cust.id <- factor(cust.df$cust.id)
str(cust.df$cust.id)
Factor w/ 1000 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
Option: ordered=TRUE (or ordered() function) creates ordinal factors.
Let's look at scatterplots. How does age relate to credit score?
plot(x=cust.df$age, y=cust.df$credit.score)
Add color, labels, and adjust the axis limits:
plot(cust.df$age, cust.df$credit.score,
col="blue",
xlim=c(15, 55), ylim=c(500, 900),
main="Active Customers as of June 2014",
xlab="Customer Age (years)", ylab="Credit Score ")