Chris Chapman and Elea McDonnell Feit

January 2016

**Chapter 2: Basics of the R Language**

Website for all data files:

http://r-marketing.r-forge.r-project.org/data.html

\( \rightarrow \) We'll cover some of the basic object types in R

\( \rightarrow \) Objects in R include variables, data sets, and functions

The assignment operator <- assigns a value to a named object.

```
x <- c(2, 4, 6, 8)
x
```

```
[1] 2 4 6 8
```

Object names are case sensitive. Instead of 'x', 'X' produces an error:

```
X
```

```
Error in eval(expr, envir, enclos): object 'X' not found
```

We've just seen how to create a vector: the c() function concatenates individual items into a vector.

```
xNum <- c(1, 3.14159, 5, 7)
xNum
```

```
[1] 1.00000 3.14159 5.00000 7.00000
```

```
xLog <- c(TRUE, FALSE, TRUE, TRUE)
xLog
```

```
[1] TRUE FALSE TRUE TRUE
```

```
xChar <- c("foo", "bar", "boo", "far")
xChar
```

```
[1] "foo" "bar" "boo" "far"
```

A vector can only hold a single type of value (number, text, etc). Values are coerced to the most general type.

```
xMix <- c(1, TRUE, 3, "Hello, world!")
xMix
```

```
[1] "1" "TRUE" "3" "Hello, world!"
```

c() can be used to add vectors just as it adds single items:

```
x2 <- c(x, x)
x2
[1] 2 4 6 8 2 4 6 8
```

Type coercion will be applied as needed:

```
c(x2, 100)
```

```
[1] 2 4 6 8 2 4 6 8 100
```

```
c(x2, "Hello")
```

```
[1] "2" "4" "6" "8" "2" "4" "6" "8" "Hello"
```

```
xMix
```

```
[1] "1" "TRUE" "3" "Hello, world!"
```

```
xMix[1] # we'll see more on indices later
```

```
[1] "1"
```

```
as.numeric(xMix[1]) # forces it to "numeric"
```

```
[1] 1
```

```
as.numeric(xMix[1]) + 1.5
```

```
[1] 2.5
```

There are many ways to get help for R:

Command/Source | Note |
---|---|

R: ?someword |
to get help on someword that R knows |

R: ??someword |
to search all R help files for the word in text |

R: ? or ??“some string” |
search for a string, character, etc. that doesn't work as a word |

R: vignette() |
list all the vignettes available |

R: vignette(“zoo”) |
open the vignette named (for package) “zoo” |

Web: CRAN task view |
Suggested packages by area such as Econometrics, Clustering, etc. https://cran.r-project.org/web/views/ |

Web: R help list |
Monitored by volunteers with many R experts and authors |

Web: Google |
Understands “R” in many contexts |

Web: Stack Overflow |
Often great contributions, http://stackoverflow.com/questions/tagged/r |

The summary() function summarizes an object in a way that is (usually) appropriate for its data type:

```
summary(xNum)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.606 4.071 4.035 5.500 7.000
```

```
summary(xChar)
```

```
Length Class Mode
4 character character
```

```
x2
```

```
[1] 2 4 6 8 2 4 6 8
```

```
x2 + 1
```

```
[1] 3 5 7 9 3 5 7 9
```

```
x2 * pi
```

```
[1] 6.283185 12.566371 18.849556 25.132741 6.283185 12.566371 18.849556
[8] 25.132741
```

R will generalize operations across multiple vectors (or matrices) as best it can:

```
x
```

```
[1] 2 4 6 8
```

```
x2 # longer than x
```

```
[1] 2 4 6 8 2 4 6 8
```

```
(x+cos(0.5)) * x2 # x is recycled to match x2
```

```
[1] 5.755165 19.510330 41.265495 71.020660 5.755165 19.510330 41.265495
[8] 71.020660
```

So vectors can be recycled. How do you find the length?

```
length(x)
```

```
[1] 4
```

```
length(x2)
```

```
[1] 8
```

A more general solution is to investigate the structure:

```
str(x2)
```

```
num [1:8] 2 4 6 8 2 4 6 8
```

```
str(xChar)
```

```
chr [1:4] "foo" "bar" "boo" "far"
```

Basic 1-by-1 integer sequences are constructed with the “:” operator:

```
xSeq <- 1:10
xSeq
```

```
[1] 1 2 3 4 5 6 7 8 9 10
```

Be careful with operator precedence … clarify liberally with parentheses:

```
1:5*2
```

```
[1] 2 4 6 8 10
```

```
1:(5*2)
```

```
[1] 1 2 3 4 5 6 7 8 9 10
```

Basic indexing uses a set of integers to select positions:

```
xNum
```

```
[1] 1.00000 3.14159 5.00000 7.00000
```

```
xNum[2:4]
```

```
[1] 3.14159 5.00000 7.00000
```

```
xNum[c(1,3)]
```

```
[1] 1 5
```

Variables and math operators can be used as well:

```
myStart <- 2
xNum[myStart:sqrt(myStart+7)]
```

```
[1] 3.14159 5.00000
```

A negative index omits elements (returns everything else):

```
xSeq
```

```
[1] 1 2 3 4 5 6 7 8 9 10
```

```
xSeq[-5:-7]
```

```
[1] 1 2 3 4 8 9 10
```

Boolean values of TRUE and FALSE can be used to select items:

```
xNum
```

```
[1] 1.00000 3.14159 5.00000 7.00000
```

```
xNum[c(FALSE, TRUE, TRUE, TRUE)]
```

```
[1] 3.14159 5.00000 7.00000
```

This is most often used with comparative operators to select items:

```
xNum > 3
```

```
[1] FALSE TRUE TRUE TRUE
```

```
xNum[xNum > 3]
```

```
[1] 3.14159 5.00000 7.00000
```

```
my.test.scores <- c(91, 93, NA, NA)
mean(my.test.scores)
```

```
[1] NA
```

```
max(my.test.scores)
```

```
[1] NA
```

```
mean(my.test.scores, na.rm=TRUE)
```

```
[1] 92
```

```
max(my.test.scores, na.rm=TRUE)
```

```
[1] 93
```

```
na.omit(my.test.scores)
```

```
[1] 91 93
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"
```

```
mean(na.omit(my.test.scores))
```

```
[1] 92
```

```
is.na(my.test.scores)
```

```
[1] FALSE FALSE TRUE TRUE
```

```
my.test.scores[!is.na(my.test.scores)]
```

```
[1] 91 93
```

- Skipping lists for today's tutorial
- They are important but not directly used as often as other data formats
- See the book for detail, section 2.4.7

\( \rightarrow \) Data frames are the most common way to handle data sets in R.

```
x.df <- data.frame(xNum, xLog, xChar)
x.df
```

```
xNum xLog xChar
1 1.00000 TRUE foo
2 3.14159 FALSE bar
3 5.00000 TRUE boo
4 7.00000 TRUE far
```

Data frames have *names* and are indexed *row, column*.

```
x.df[2,1]
```

```
[1] 3.14159
```

```
x.df[1,3]
```

```
[1] foo
Levels: bar boo far foo
```

By default, text data is converted to factors. You'll often want to turn that off:

```
x.df[1,3]
```

```
[1] foo
Levels: bar boo far foo
```

```
x.df <- data.frame(xNum, xLog, xChar, stringsAsFactors=FALSE)
x.df[1,3]
```

```
[1] "foo"
```

```
x.df[2, ] # all of row 2
```

```
xNum xLog xChar
2 3.14159 FALSE bar
```

```
x.df[ ,3] # all of column 3
```

```
[1] "foo" "bar" "boo" "far"
```

```
x.df[2:3, ]
```

```
xNum xLog xChar
2 3.14159 FALSE bar
3 5.00000 TRUE boo
```

```
x.df[ ,1:2]
```

```
xNum xLog
1 1.00000 TRUE
2 3.14159 FALSE
3 5.00000 TRUE
4 7.00000 TRUE
```

```
x.df[-3, ] # omit the third observation
```

```
xNum xLog xChar
1 1.00000 TRUE foo
2 3.14159 FALSE bar
4 7.00000 TRUE far
```

```
x.df[, -2] # omit the second column
```

```
xNum xChar
1 1.00000 foo
2 3.14159 bar
3 5.00000 boo
4 7.00000 far
```

Warning: we're about to delete everything first

```
rm(list=ls()) # caution, deletes all objects
```

```
store.num <- factor(c(3, 14, 21, 32, 54)) # store id
store.rev <- c(543, 654, 345, 678, 234) # store revenue, $K
store.visits <- c(45, 78, 32, 56, 34) # visits, 1000s
store.manager <- c("Annie", "Bert", "Carla", "Dave", "Ella")
(store.df <- data.frame(store.num, store.rev, store.visits,
store.manager, stringsAsFactors=F))
```

```
store.num store.rev store.visits store.manager
1 3 543 45 Annie
2 14 654 78 Bert
3 21 345 32 Carla
4 32 678 56 Dave
5 54 234 34 Ella
```

```
summary(store.df) # always recommended!
```

```
store.num store.rev store.visits store.manager
3 :1 Min. :234.0 Min. :32 Length:5
14:1 1st Qu.:345.0 1st Qu.:34 Class :character
21:1 Median :543.0 Median :45 Mode :character
32:1 Mean :490.8 Mean :49
54:1 3rd Qu.:654.0 3rd Qu.:56
Max. :678.0 Max. :78
```

```
store.df$store.manager
```

```
[1] "Annie" "Bert" "Carla" "Dave" "Ella"
```

```
mean(store.df$store.rev)
```

```
[1] 490.8
```

```
write.csv(store.df, row.names=FALSE)
```

```
"store.num","store.rev","store.visits","store.manager"
"3",543,45,"Annie"
"14",654,78,"Bert"
"21",345,32,"Carla"
"32",678,56,"Dave"
"54",234,34,"Ella"
```

```
write.csv(store.df, file="store-df.csv", row.names=FALSE)
read.csv("store-df.csv") # "file=" is optional
```

```
store.num store.rev store.visits store.manager
1 3 543 45 Annie
2 14 654 78 Bert
3 21 345 32 Carla
4 32 678 56 Dave
5 54 234 34 Ella
```

Access the `Salaries`

data set:

```
library(car) # install.packages("car") if needed
data(Salaries)
```

- How many variables and observations are there in the data set?
- How many professors have more than 40 years of service?

(\( \rightarrow \) hint: you cana logical vector)`sum()`

- How many have salary > $150000?
- What is the mean salary for professors with >20 years service?
- How do you find out more about the data set?

- How many variables and observations are there in the data?
- How many professors have more than 40 years of service?
- Which observations have < 1 year of service?
- What is the mean salary for professors with >20 years service?
- How do you find out more about the data set?

```
dim(Salaries) # or even better: str(Salaries)
```

```
[1] 397 6
```

```
sum(Salaries$yrs.service > 40)
```

```
[1] 21
```

```
Salaries[Salaries$yrs.service > 20, ] # output not shown
```

```
mean(Salaries[Salaries$yrs.service > 20, "salary"])
```

```
[1] 122103.9
```

```
?Salaries
```

- Basic Functions
- Sequences, again
- Interesting numbers
- Load and save raw data

```
se <- function(x) { sd(x) / sqrt(length(x)) }
se(store.df$store.visits)
```

```
[1] 8.42615
```

```
mean(store.df$store.visits) + 1.96 * se(store.df$store.visits)
```

```
[1] 65.51525
```

A function has:

- an assigned name (created with '<-')
- zero or more arguments that it operates on (in () )
- a body (usually in { }) with lines of code
- a return value (the last computed value, by default)

```
se <- function(x) {
# computes standard error of the mean
tmp.sd <- sd(x) # standard deviation
tmp.N <- length(x) # sample size
tmp.se <- tmp.sd / sqrt(tmp.N) # std error of the mean
return(tmp.se) # return() is optional but clear
}
se(store.df$store.visits)
```

```
[1] 8.42615
```

This is much better! You can examine it to see what it does:

```
se
```

```
function(x) {
# computes standard error of the mean
tmp.sd <- sd(x) # standard deviation
tmp.N <- length(x) # sample size
tmp.se <- tmp.sd / sqrt(tmp.N) # std error of the mean
return(tmp.se) # return() is optional but clear
}
```

The seq() function constructs sequences in various ways:

```
seq(from=-5, to=28, by=4)
```

```
[1] -5 -1 3 7 11 15 19 23 27
```

```
seq(from=-5, to=28, length=6)
```

```
[1] -5.0 1.6 8.2 14.8 21.4 28.0
```

The rep() (repeat) function is also useful. It is especially good for constructing indices into data sets with repeating structure:

```
rep(c(1,2,3), each=3)
```

```
[1] 1 1 1 2 2 2 3 3 3
```

```
rep(seq(from=-3, to=13, by=4), c(1, 2, 3, 2, 1))
```

```
[1] -3 1 1 5 5 5 9 9 13
```

```
1/0
```

```
[1] Inf
```

```
log(c(-1,0,1))
```

```
[1] NaN -Inf 0
```

```
sqrt(-2)
```

```
[1] NaN
```

```
sqrt(2i)
```

```
[1] 1+1i
```

You can use these values yourself (occasionally it makes sense):

```
10 < Inf
```

```
[1] TRUE
```

```
save(store.df, file="store-df-backup.RData")
rm(store.df)
mean(store.df$store.rev) # error
```

```
Error in mean(store.df$store.rev): object 'store.df' not found
```

```
load("store-df-backup.RData")
mean(store.df$store.rev) # works now
```

```
[1] 490.8
```

```
store.df <- 5
store.df
```

```
[1] 5
```

```
load("store-df-backup.RData")
store.df
```

```
store.num store.rev store.visits store.manager
1 3 543 45 Annie
2 14 654 78 Bert
3 21 345 32 Carla
4 32 678 56 Dave
5 54 234 34 Ella
```

Save to “.Rdata”:

- save.image()

Save to arbitrary filename

- save.image(“mywork.RData”)

Load an image

- load(“mywork.RData”)

This presentation is based on Chapter 6 of Chapman and Feit, *R for Marketing Research and Analytics* © 2015 Springer. http://r-marketing.r-forge.r-project.org/

Exercises here use the `Salaries`

data set from the `car`

package, John Fox and Sanford Weisberg (2011). *An R Companion to Applied Regression*, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

All code in the presentation is licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.