R for Marketing Research and Analytics

Chris Chapman and Elea McDonnell Feit
January 2016

Chapter 2: Basics of the R Language

Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html

Objects

$ \rightarrow $ We'll cover some of the basic object types in R

$ \rightarrow $ Objects in R include variables, data sets, and functions

Basic objects

The assignment operator <- assigns a value to a named object.

x <- c(2, 4, 6, 8)
x

[1] 2 4 6 8

Object names are case sensitive. Instead of 'x', 'X' produces an error:

Error in eval(expr, envir, enclos): object 'X' not found

Vectors

We've just seen how to create a vector: the c() function concatenates individual items into a vector.

xNum  <- c(1, 3.14159, 5, 7)
xNum

[1] 1.00000 3.14159 5.00000 7.00000

xLog  <- c(TRUE, FALSE, TRUE, TRUE)
xLog

[1]  TRUE FALSE  TRUE  TRUE

xChar <- c("foo", "bar", "boo", "far")
xChar

[1] "foo" "bar" "boo" "far"

Vectors: Type Coercion

A vector can only hold a single type of value (number, text, etc). Values are coerced to the most general type.

xMix  <- c(1, TRUE, 3, "Hello, world!") 
xMix

[1] "1"             "TRUE"          "3"             "Hello, world!"

More about vectors

c() can be used to add vectors just as it adds single items:

x2 <- c(x, x)
x2
[1] 2 4 6 8 2 4 6 8

Type coercion will be applied as needed:

c(x2, 100)

[1]   2   4   6   8   2   4   6   8 100

c(x2, "Hello")

[1] "2"     "4"     "6"     "8"     "2"     "4"     "6"     "8"     "Hello"

Forcing coercion

xMix

[1] "1"             "TRUE"          "3"             "Hello, world!"

xMix[1]   # we'll see more on indices later

[1] "1"

as.numeric(xMix[1])   # forces it to "numeric"

[1] 1

as.numeric(xMix[1]) + 1.5

[1] 2.5

Help!

There are many ways to get help for R:

Command/Source	Note
R: ?someword	to get help on someword that R knows
R: ??someword	to search all R help files for the word in text
R: ? or ??“some string”	search for a string, character, etc. that doesn't work as a word
R: vignette()	list all the vignettes available
R: vignette(“zoo”)	open the vignette named (for package) “zoo”
Web: CRAN task view	Suggested packages by area such as Econometrics, Clustering, etc. https://cran.r-project.org/web/views/
Web: R help list	Monitored by volunteers with many R experts and authors
Web: Google	Understands “R” in many contexts
Web: Stack Overflow	Often great contributions, http://stackoverflow.com/questions/tagged/r

Summary

The summary() function summarizes an object in a way that is (usually) appropriate for its data type:

summary(xNum)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.606   4.071   4.035   5.500   7.000

summary(xChar)

   Length     Class      Mode 
        4 character character

Math on Vectors

x2

[1] 2 4 6 8 2 4 6 8

x2 + 1

[1] 3 5 7 9 3 5 7 9

x2 * pi

[1]  6.283185 12.566371 18.849556 25.132741  6.283185 12.566371 18.849556
[8] 25.132741

More complex math

R will generalize operations across multiple vectors (or matrices) as best it can:

[1] 2 4 6 8

x2    # longer than x

[1] 2 4 6 8 2 4 6 8

(x+cos(0.5)) * x2     # x is recycled to match x2

[1]  5.755165 19.510330 41.265495 71.020660  5.755165 19.510330 41.265495
[8] 71.020660

Length and structure

So vectors can be recycled. How do you find the length?

length(x)

[1] 4

length(x2)

[1] 8

A more general solution is to investigate the structure:

str(x2)

 num [1:8] 2 4 6 8 2 4 6 8

str(xChar)

 chr [1:4] "foo" "bar" "boo" "far"

Sequences and Indexing

Sequences

Basic 1-by-1 integer sequences are constructed with the “:” operator:

xSeq <- 1:10
xSeq

 [1]  1  2  3  4  5  6  7  8  9 10

Be careful with operator precedence … clarify liberally with parentheses:

1:5*2

[1]  2  4  6  8 10

1:(5*2)

 [1]  1  2  3  4  5  6  7  8  9 10

Indexing a vector 1

Basic indexing uses a set of integers to select positions:

xNum

[1] 1.00000 3.14159 5.00000 7.00000

xNum[2:4]

[1] 3.14159 5.00000 7.00000

xNum[c(1,3)]

[1] 1 5

Variables and math operators can be used as well:

myStart <- 2
xNum[myStart:sqrt(myStart+7)]

[1] 3.14159 5.00000

Negative indexing

A negative index omits elements (returns everything else):

xSeq

 [1]  1  2  3  4  5  6  7  8  9 10

xSeq[-5:-7]

[1]  1  2  3  4  8  9 10

Indexing with Boolean

Boolean values of TRUE and FALSE can be used to select items:

xNum

[1] 1.00000 3.14159 5.00000 7.00000

xNum[c(FALSE, TRUE, TRUE, TRUE)]

[1] 3.14159 5.00000 7.00000

This is most often used with comparative operators to select items:

xNum > 3

[1] FALSE  TRUE  TRUE  TRUE

xNum[xNum > 3]

[1] 3.14159 5.00000 7.00000

Missing and interesting values

my.test.scores <- c(91, 93, NA, NA)

mean(my.test.scores)

[1] NA

max(my.test.scores)

[1] NA

mean(my.test.scores, na.rm=TRUE)

[1] 92

max(my.test.scores, na.rm=TRUE)

[1] 93

Other ways to omit

na.omit(my.test.scores)

[1] 91 93
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"

mean(na.omit(my.test.scores))

[1] 92

is.na(my.test.scores)

[1] FALSE FALSE  TRUE  TRUE

my.test.scores[!is.na(my.test.scores)]

[1] 91 93

More Complex Structures

Lists

Skipping lists for today's tutorial
They are important but not directly used as often as other data formats
See the book for detail, section 2.4.7

Data frames

$ \rightarrow $ Data frames are the most common way to handle data sets in R.

Data frames 1

x.df <- data.frame(xNum, xLog, xChar)
x.df

     xNum  xLog xChar
1 1.00000  TRUE   foo
2 3.14159 FALSE   bar
3 5.00000  TRUE   boo
4 7.00000  TRUE   far

Data frames have names and are indexed row, column.

x.df[2,1]

[1] 3.14159

x.df[1,3]

[1] foo
Levels: bar boo far foo

Data frames 2

By default, text data is converted to factors. You'll often want to turn that off:

x.df[1,3]

[1] foo
Levels: bar boo far foo

x.df <- data.frame(xNum, xLog, xChar, stringsAsFactors=FALSE)
x.df[1,3]

[1] "foo"

Indexing data frames

x.df[2, ]  # all of row 2

     xNum  xLog xChar
2 3.14159 FALSE   bar

x.df[ ,3]  # all of column 3

[1] "foo" "bar" "boo" "far"

x.df[2:3, ]

     xNum  xLog xChar
2 3.14159 FALSE   bar
3 5.00000  TRUE   boo

x.df[ ,1:2]

     xNum  xLog
1 1.00000  TRUE
2 3.14159 FALSE
3 5.00000  TRUE
4 7.00000  TRUE

Negative indexing data frames

x.df[-3, ]  # omit the third observation

     xNum  xLog xChar
1 1.00000  TRUE   foo
2 3.14159 FALSE   bar
4 7.00000  TRUE   far

x.df[, -2]  # omit the second column

     xNum xChar
1 1.00000   foo
2 3.14159   bar
3 5.00000   boo
4 7.00000   far

Let's create more interesting data

Warning: we're about to delete everything first

rm(list=ls())    # caution, deletes all objects

Store data

store.num <- factor(c(3, 14, 21, 32, 54)) # store id
store.rev <- c(543, 654, 345, 678, 234)   # store revenue, $K
store.visits <- c(45, 78, 32, 56, 34)     # visits, 1000s
store.manager <- c("Annie", "Bert", "Carla", "Dave", "Ella")

(store.df <- data.frame(store.num, store.rev, store.visits,
                        store.manager, stringsAsFactors=F))

  store.num store.rev store.visits store.manager
1         3       543           45         Annie
2        14       654           78          Bert
3        21       345           32         Carla
4        32       678           56          Dave
5        54       234           34          Ella

Some data checks

summary(store.df)   # always recommended!

 store.num   store.rev      store.visits store.manager     
 3 :1      Min.   :234.0   Min.   :32    Length:5          
 14:1      1st Qu.:345.0   1st Qu.:34    Class :character  
 21:1      Median :543.0   Median :45    Mode  :character  
 32:1      Mean   :490.8   Mean   :49                      
 54:1      3rd Qu.:654.0   3rd Qu.:56                      
           Max.   :678.0   Max.   :78

store.df$store.manager

[1] "Annie" "Bert"  "Carla" "Dave"  "Ella"

mean(store.df$store.rev)

[1] 490.8

Read and write CSVs

write.csv(store.df, row.names=FALSE)

"store.num","store.rev","store.visits","store.manager"
"3",543,45,"Annie"
"14",654,78,"Bert"
"21",345,32,"Carla"
"32",678,56,"Dave"
"54",234,34,"Ella"

write.csv(store.df, file="store-df.csv", row.names=FALSE)
read.csv("store-df.csv")  # "file=" is optional

  store.num store.rev store.visits store.manager
1         3       543           45         Annie
2        14       654           78          Bert
3        21       345           32         Carla
4        32       678           56          Dave
5        54       234           34          Ella

Exercises

Exercise!

Access the Salaries data set:

library(car)    # install.packages("car") if needed
data(Salaries)

How many variables and observations are there in the data set?
How many professors have more than 40 years of service?
($ \rightarrow $ hint: you can sum() a logical vector)
How many have salary > $150000?
What is the mean salary for professors with >20 years service?
How do you find out more about the data set?

One Set of Answers

How many variables and observations are there in the data?
How many professors have more than 40 years of service?
Which observations have < 1 year of service?
What is the mean salary for professors with >20 years service?
How do you find out more about the data set?

dim(Salaries)                             # or even better: str(Salaries)

[1] 397   6

sum(Salaries$yrs.service > 40)

[1] 21

Salaries[Salaries$yrs.service > 20, ]      # output not shown

mean(Salaries[Salaries$yrs.service > 20, "salary"])

[1] 122103.9

?Salaries

Optional Topics

Basic Functions
Sequences, again
Interesting numbers
Load and save raw data

Writing Basic Functions

se <- function(x) { sd(x) / sqrt(length(x)) }

se(store.df$store.visits)

[1] 8.42615

mean(store.df$store.visits) + 1.96 * se(store.df$store.visits)

[1] 65.51525

A function has:

an assigned name (created with '<-')
zero or more arguments that it operates on (in () )
a body (usually in { }) with lines of code
a return value (the last computed value, by default)

Document your functions inline!

se <- function(x) {
  # computes standard error of the mean
  tmp.sd <- sd(x)      # standard deviation
  tmp.N  <- length(x)  # sample size
  tmp.se <- tmp.sd / sqrt(tmp.N)   # std error of the mean
  return(tmp.se)       # return() is optional but clear
}

se(store.df$store.visits)

[1] 8.42615

This is much better! You can examine it to see what it does:

se

function(x) {
  # computes standard error of the mean
  tmp.sd <- sd(x)      # standard deviation
  tmp.N  <- length(x)  # sample size
  tmp.se <- tmp.sd / sqrt(tmp.N)   # std error of the mean
  return(tmp.se)       # return() is optional but clear
}

Other ways to make sequences

The seq() function constructs sequences in various ways:

seq(from=-5, to=28, by=4)

[1] -5 -1  3  7 11 15 19 23 27

seq(from=-5, to=28, length=6)

[1] -5.0  1.6  8.2 14.8 21.4 28.0

The rep() (repeat) function is also useful. It is especially good for constructing indices into data sets with repeating structure:

rep(c(1,2,3), each=3)

[1] 1 1 1 2 2 2 3 3 3

rep(seq(from=-3, to=13, by=4), c(1, 2, 3, 2, 1))

[1] -3  1  1  5  5  5  9  9 13

Infinite and Impossible numbers

1/0

[1] Inf

log(c(-1,0,1))

[1]  NaN -Inf    0

sqrt(-2)

[1] NaN

sqrt(2i)

[1] 1+1i

You can use these values yourself (occasionally it makes sense):

10 < Inf

[1] TRUE

Loading and saving raw data formats

save(store.df, file="store-df-backup.RData")

rm(store.df)        
mean(store.df$store.rev)     # error

Error in mean(store.df$store.rev): object 'store.df' not found

load("store-df-backup.RData")
mean(store.df$store.rev)     # works now

[1] 490.8

Loading data has silent overwrite

store.df <- 5
store.df

[1] 5

load("store-df-backup.RData")
store.df

  store.num store.rev store.visits store.manager
1         3       543           45         Annie
2        14       654           78          Bert
3        21       345           32         Carla
4        32       678           56          Dave
5        54       234           34          Ella

Saving images (but generally, don't!)

Save to “.Rdata”:

save.image()

Save to arbitrary filename

save.image(“mywork.RData”)

Load an image

load(“mywork.RData”)

That's all for Chapter 2!

Break time

Notes

Exercises here use the Salaries data set from the car package, John Fox and Sanford Weisberg (2011). An R Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

All code in the presentation is licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.