# R for Marketing Research and Analytics

Chris Chapman and Elea McDonnell Feit
January 2016

Chapter 2: Basics of the R Language

Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html

## Objects

$$\rightarrow$$ We'll cover some of the basic object types in R

$$\rightarrow$$ Objects in R include variables, data sets, and functions

### Basic objects

The assignment operator <- assigns a value to a named object.

x <- c(2, 4, 6, 8)
x

[1] 2 4 6 8


Object names are case sensitive. Instead of 'x', 'X' produces an error:

X

Error in eval(expr, envir, enclos): object 'X' not found


### Vectors

We've just seen how to create a vector: the c() function concatenates individual items into a vector.

xNum  <- c(1, 3.14159, 5, 7)
xNum

[1] 1.00000 3.14159 5.00000 7.00000

xLog  <- c(TRUE, FALSE, TRUE, TRUE)
xLog

[1]  TRUE FALSE  TRUE  TRUE

xChar <- c("foo", "bar", "boo", "far")
xChar

[1] "foo" "bar" "boo" "far"


### Vectors: Type Coercion

A vector can only hold a single type of value (number, text, etc). Values are coerced to the most general type.

xMix  <- c(1, TRUE, 3, "Hello, world!")
xMix

[1] "1"             "TRUE"          "3"             "Hello, world!"


c() can be used to add vectors just as it adds single items:

x2 <- c(x, x)
x2
[1] 2 4 6 8 2 4 6 8


Type coercion will be applied as needed:

c(x2, 100)

[1]   2   4   6   8   2   4   6   8 100

c(x2, "Hello")

[1] "2"     "4"     "6"     "8"     "2"     "4"     "6"     "8"     "Hello"


### Forcing coercion

xMix

[1] "1"             "TRUE"          "3"             "Hello, world!"

xMix[1]   # we'll see more on indices later

[1] "1"

as.numeric(xMix[1])   # forces it to "numeric"

[1] 1

as.numeric(xMix[1]) + 1.5

[1] 2.5


### Help!

There are many ways to get help for R:

Command/Source Note
R: ?someword to get help on someword that R knows
R: ??someword to search all R help files for the word in text
R: ? or ??“some string” search for a string, character, etc. that doesn't work as a word
R: vignette() list all the vignettes available
R: vignette(“zoo”) open the vignette named (for package) “zoo”
Web: CRAN task view Suggested packages by area such as Econometrics, Clustering, etc. https://cran.r-project.org/web/views/
Web: R help list Monitored by volunteers with many R experts and authors
Web: Google Understands “R” in many contexts
Web: Stack Overflow Often great contributions, http://stackoverflow.com/questions/tagged/r

### Summary

The summary() function summarizes an object in a way that is (usually) appropriate for its data type:

summary(xNum)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.000   2.606   4.071   4.035   5.500   7.000

summary(xChar)

   Length     Class      Mode
4 character character


### Math on Vectors

x2

[1] 2 4 6 8 2 4 6 8

x2 + 1

[1] 3 5 7 9 3 5 7 9

x2 * pi

[1]  6.283185 12.566371 18.849556 25.132741  6.283185 12.566371 18.849556
[8] 25.132741


### More complex math

R will generalize operations across multiple vectors (or matrices) as best it can:

x

[1] 2 4 6 8

x2    # longer than x

[1] 2 4 6 8 2 4 6 8

(x+cos(0.5)) * x2     # x is recycled to match x2

[1]  5.755165 19.510330 41.265495 71.020660  5.755165 19.510330 41.265495
[8] 71.020660


### Length and structure

So vectors can be recycled. How do you find the length?

length(x)

[1] 4

length(x2)

[1] 8


A more general solution is to investigate the structure:

str(x2)

 num [1:8] 2 4 6 8 2 4 6 8

str(xChar)

 chr [1:4] "foo" "bar" "boo" "far"


## Sequences and Indexing

### Sequences

Basic 1-by-1 integer sequences are constructed with the “:” operator:

xSeq <- 1:10
xSeq

 [1]  1  2  3  4  5  6  7  8  9 10


Be careful with operator precedence … clarify liberally with parentheses:

1:5*2

[1]  2  4  6  8 10

1:(5*2)

 [1]  1  2  3  4  5  6  7  8  9 10


### Indexing a vector 1

Basic indexing uses a set of integers to select positions:

xNum

[1] 1.00000 3.14159 5.00000 7.00000

xNum[2:4]

[1] 3.14159 5.00000 7.00000

xNum[c(1,3)]

[1] 1 5


Variables and math operators can be used as well:

myStart <- 2
xNum[myStart:sqrt(myStart+7)]

[1] 3.14159 5.00000


### Negative indexing

A negative index omits elements (returns everything else):

xSeq

 [1]  1  2  3  4  5  6  7  8  9 10

xSeq[-5:-7]

[1]  1  2  3  4  8  9 10


### Indexing with Boolean

Boolean values of TRUE and FALSE can be used to select items:

xNum

[1] 1.00000 3.14159 5.00000 7.00000

xNum[c(FALSE, TRUE, TRUE, TRUE)]

[1] 3.14159 5.00000 7.00000


This is most often used with comparative operators to select items:

xNum > 3

[1] FALSE  TRUE  TRUE  TRUE

xNum[xNum > 3]

[1] 3.14159 5.00000 7.00000


### Missing and interesting values

my.test.scores <- c(91, 93, NA, NA)

mean(my.test.scores)

[1] NA

max(my.test.scores)

[1] NA

mean(my.test.scores, na.rm=TRUE)

[1] 92

max(my.test.scores, na.rm=TRUE)

[1] 93


### Other ways to omit

na.omit(my.test.scores)

[1] 91 93
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"

mean(na.omit(my.test.scores))

[1] 92

is.na(my.test.scores)

[1] FALSE FALSE  TRUE  TRUE

my.test.scores[!is.na(my.test.scores)]

[1] 91 93


## More Complex Structures

### Lists

• Skipping lists for today's tutorial
• They are important but not directly used as often as other data formats
• See the book for detail, section 2.4.7

## Data frames

$$\rightarrow$$ Data frames are the most common way to handle data sets in R.

### Data frames 1

x.df <- data.frame(xNum, xLog, xChar)
x.df

     xNum  xLog xChar
1 1.00000  TRUE   foo
2 3.14159 FALSE   bar
3 5.00000  TRUE   boo
4 7.00000  TRUE   far


Data frames have names and are indexed row, column.

x.df[2,1]

[1] 3.14159

x.df[1,3]

[1] foo
Levels: bar boo far foo


### Data frames 2

By default, text data is converted to factors. You'll often want to turn that off:

x.df[1,3]

[1] foo
Levels: bar boo far foo

x.df <- data.frame(xNum, xLog, xChar, stringsAsFactors=FALSE)
x.df[1,3]

[1] "foo"


### Indexing data frames

x.df[2, ]  # all of row 2

     xNum  xLog xChar
2 3.14159 FALSE   bar

x.df[ ,3]  # all of column 3

[1] "foo" "bar" "boo" "far"

x.df[2:3, ]

     xNum  xLog xChar
2 3.14159 FALSE   bar
3 5.00000  TRUE   boo

x.df[ ,1:2]

     xNum  xLog
1 1.00000  TRUE
2 3.14159 FALSE
3 5.00000  TRUE
4 7.00000  TRUE


### Negative indexing data frames

x.df[-3, ]  # omit the third observation

     xNum  xLog xChar
1 1.00000  TRUE   foo
2 3.14159 FALSE   bar
4 7.00000  TRUE   far

x.df[, -2]  # omit the second column

     xNum xChar
1 1.00000   foo
2 3.14159   bar
3 5.00000   boo
4 7.00000   far


### Let's create more interesting data

Warning: we're about to delete everything first

rm(list=ls())    # caution, deletes all objects


### Store data

store.num <- factor(c(3, 14, 21, 32, 54)) # store id
store.rev <- c(543, 654, 345, 678, 234)   # store revenue, $K store.visits <- c(45, 78, 32, 56, 34) # visits, 1000s store.manager <- c("Annie", "Bert", "Carla", "Dave", "Ella") (store.df <- data.frame(store.num, store.rev, store.visits, store.manager, stringsAsFactors=F))   store.num store.rev store.visits store.manager 1 3 543 45 Annie 2 14 654 78 Bert 3 21 345 32 Carla 4 32 678 56 Dave 5 54 234 34 Ella  ### Some data checks summary(store.df) # always recommended!   store.num store.rev store.visits store.manager 3 :1 Min. :234.0 Min. :32 Length:5 14:1 1st Qu.:345.0 1st Qu.:34 Class :character 21:1 Median :543.0 Median :45 Mode :character 32:1 Mean :490.8 Mean :49 54:1 3rd Qu.:654.0 3rd Qu.:56 Max. :678.0 Max. :78  store.df$store.manager

[1] "Annie" "Bert"  "Carla" "Dave"  "Ella"

mean(store.df$store.rev)  [1] 490.8  ### Read and write CSVs write.csv(store.df, row.names=FALSE)  "store.num","store.rev","store.visits","store.manager" "3",543,45,"Annie" "14",654,78,"Bert" "21",345,32,"Carla" "32",678,56,"Dave" "54",234,34,"Ella"  write.csv(store.df, file="store-df.csv", row.names=FALSE) read.csv("store-df.csv") # "file=" is optional   store.num store.rev store.visits store.manager 1 3 543 45 Annie 2 14 654 78 Bert 3 21 345 32 Carla 4 32 678 56 Dave 5 54 234 34 Ella  ## Exercises ### Exercise! Access the Salaries data set: library(car) # install.packages("car") if needed data(Salaries)  1. How many variables and observations are there in the data set? 2. How many professors have more than 40 years of service? ($$\rightarrow$$ hint: you can sum() a logical vector) 3. How many have salary >$150000?
4. What is the mean salary for professors with >20 years service?
5. How do you find out more about the data set?

1. How many variables and observations are there in the data?
2. How many professors have more than 40 years of service?
3. Which observations have < 1 year of service?
4. What is the mean salary for professors with >20 years service?
5. How do you find out more about the data set?
dim(Salaries)                             # or even better: str(Salaries)

[1] 397   6

sum(Salaries$yrs.service > 40)  [1] 21  Salaries[Salaries$yrs.service > 20, ]      # output not shown

mean(Salaries[Salaries$yrs.service > 20, "salary"])  [1] 122103.9  ?Salaries  ## Optional Topics • Basic Functions • Sequences, again • Interesting numbers • Load and save raw data ### Writing Basic Functions se <- function(x) { sd(x) / sqrt(length(x)) } se(store.df$store.visits)

[1] 8.42615

mean(store.df$store.visits) + 1.96 * se(store.df$store.visits)

[1] 65.51525


A function has:

• an assigned name (created with '<-')
• zero or more arguments that it operates on (in () )
• a body (usually in { }) with lines of code
• a return value (the last computed value, by default)

se <- function(x) {
# computes standard error of the mean
tmp.sd <- sd(x)      # standard deviation
tmp.N  <- length(x)  # sample size
tmp.se <- tmp.sd / sqrt(tmp.N)   # std error of the mean
return(tmp.se)       # return() is optional but clear
}

se(store.df$store.visits)  [1] 8.42615  This is much better! You can examine it to see what it does: se  function(x) { # computes standard error of the mean tmp.sd <- sd(x) # standard deviation tmp.N <- length(x) # sample size tmp.se <- tmp.sd / sqrt(tmp.N) # std error of the mean return(tmp.se) # return() is optional but clear }  ### Other ways to make sequences The seq() function constructs sequences in various ways: seq(from=-5, to=28, by=4)  [1] -5 -1 3 7 11 15 19 23 27  seq(from=-5, to=28, length=6)  [1] -5.0 1.6 8.2 14.8 21.4 28.0  The rep() (repeat) function is also useful. It is especially good for constructing indices into data sets with repeating structure: rep(c(1,2,3), each=3)  [1] 1 1 1 2 2 2 3 3 3  rep(seq(from=-3, to=13, by=4), c(1, 2, 3, 2, 1))  [1] -3 1 1 5 5 5 9 9 13  ### Infinite and Impossible numbers 1/0  [1] Inf  log(c(-1,0,1))  [1] NaN -Inf 0  sqrt(-2)  [1] NaN  sqrt(2i)  [1] 1+1i  You can use these values yourself (occasionally it makes sense): 10 < Inf  [1] TRUE  ### Loading and saving raw data formats save(store.df, file="store-df-backup.RData") rm(store.df) mean(store.df$store.rev)     # error

Error in mean(store.df$store.rev): object 'store.df' not found  load("store-df-backup.RData") mean(store.df$store.rev)     # works now

[1] 490.8


store.df <- 5
store.df

[1] 5

load("store-df-backup.RData")
store.df

  store.num store.rev store.visits store.manager
1         3       543           45         Annie
2        14       654           78          Bert
3        21       345           32         Carla
4        32       678           56          Dave
5        54       234           34          Ella


### Saving images (but generally, don't!)

Save to “.Rdata”:

• save.image()

Save to arbitrary filename

• save.image(“mywork.RData”)

Exercises here use the Salaries data set from the car package, John Fox and Sanford Weisberg (2011). An R Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion