Chris Chapman and Elea McDonnell Feit
January 2016
Chapter 2: Basics of the R Language
Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html
\( \rightarrow \) We'll cover some of the basic object types in R
\( \rightarrow \) Objects in R include variables, data sets, and functions
The assignment operator <- assigns a value to a named object.
x <- c(2, 4, 6, 8)
x
[1] 2 4 6 8
Object names are case sensitive. Instead of 'x', 'X' produces an error:
X
Error in eval(expr, envir, enclos): object 'X' not found
We've just seen how to create a vector: the c() function concatenates individual items into a vector.
xNum <- c(1, 3.14159, 5, 7)
xNum
[1] 1.00000 3.14159 5.00000 7.00000
xLog <- c(TRUE, FALSE, TRUE, TRUE)
xLog
[1] TRUE FALSE TRUE TRUE
xChar <- c("foo", "bar", "boo", "far")
xChar
[1] "foo" "bar" "boo" "far"
A vector can only hold a single type of value (number, text, etc). Values are coerced to the most general type.
xMix <- c(1, TRUE, 3, "Hello, world!")
xMix
[1] "1" "TRUE" "3" "Hello, world!"
c() can be used to add vectors just as it adds single items:
x2 <- c(x, x)
x2
[1] 2 4 6 8 2 4 6 8
Type coercion will be applied as needed:
c(x2, 100)
[1] 2 4 6 8 2 4 6 8 100
c(x2, "Hello")
[1] "2" "4" "6" "8" "2" "4" "6" "8" "Hello"
xMix
[1] "1" "TRUE" "3" "Hello, world!"
xMix[1] # we'll see more on indices later
[1] "1"
as.numeric(xMix[1]) # forces it to "numeric"
[1] 1
as.numeric(xMix[1]) + 1.5
[1] 2.5
There are many ways to get help for R:
Command/Source | Note |
---|---|
R: ?someword | to get help on someword that R knows |
R: ??someword | to search all R help files for the word in text |
R: ? or ??“some string” | search for a string, character, etc. that doesn't work as a word |
R: vignette() | list all the vignettes available |
R: vignette(“zoo”) | open the vignette named (for package) “zoo” |
Web: CRAN task view | Suggested packages by area such as Econometrics, Clustering, etc. https://cran.r-project.org/web/views/ |
Web: R help list | Monitored by volunteers with many R experts and authors |
Web: Google | Understands “R” in many contexts |
Web: Stack Overflow | Often great contributions, http://stackoverflow.com/questions/tagged/r |
The summary() function summarizes an object in a way that is (usually) appropriate for its data type:
summary(xNum)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.606 4.071 4.035 5.500 7.000
summary(xChar)
Length Class Mode
4 character character
x2
[1] 2 4 6 8 2 4 6 8
x2 + 1
[1] 3 5 7 9 3 5 7 9
x2 * pi
[1] 6.283185 12.566371 18.849556 25.132741 6.283185 12.566371 18.849556
[8] 25.132741
R will generalize operations across multiple vectors (or matrices) as best it can:
x
[1] 2 4 6 8
x2 # longer than x
[1] 2 4 6 8 2 4 6 8
(x+cos(0.5)) * x2 # x is recycled to match x2
[1] 5.755165 19.510330 41.265495 71.020660 5.755165 19.510330 41.265495
[8] 71.020660
So vectors can be recycled. How do you find the length?
length(x)
[1] 4
length(x2)
[1] 8
A more general solution is to investigate the structure:
str(x2)
num [1:8] 2 4 6 8 2 4 6 8
str(xChar)
chr [1:4] "foo" "bar" "boo" "far"
Basic 1-by-1 integer sequences are constructed with the “:” operator:
xSeq <- 1:10
xSeq
[1] 1 2 3 4 5 6 7 8 9 10
Be careful with operator precedence … clarify liberally with parentheses:
1:5*2
[1] 2 4 6 8 10
1:(5*2)
[1] 1 2 3 4 5 6 7 8 9 10
Basic indexing uses a set of integers to select positions:
xNum
[1] 1.00000 3.14159 5.00000 7.00000
xNum[2:4]
[1] 3.14159 5.00000 7.00000
xNum[c(1,3)]
[1] 1 5
Variables and math operators can be used as well:
myStart <- 2
xNum[myStart:sqrt(myStart+7)]
[1] 3.14159 5.00000
A negative index omits elements (returns everything else):
xSeq
[1] 1 2 3 4 5 6 7 8 9 10
xSeq[-5:-7]
[1] 1 2 3 4 8 9 10
Boolean values of TRUE and FALSE can be used to select items:
xNum
[1] 1.00000 3.14159 5.00000 7.00000
xNum[c(FALSE, TRUE, TRUE, TRUE)]
[1] 3.14159 5.00000 7.00000
This is most often used with comparative operators to select items:
xNum > 3
[1] FALSE TRUE TRUE TRUE
xNum[xNum > 3]
[1] 3.14159 5.00000 7.00000
my.test.scores <- c(91, 93, NA, NA)
mean(my.test.scores)
[1] NA
max(my.test.scores)
[1] NA
mean(my.test.scores, na.rm=TRUE)
[1] 92
max(my.test.scores, na.rm=TRUE)
[1] 93
na.omit(my.test.scores)
[1] 91 93
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"
mean(na.omit(my.test.scores))
[1] 92
is.na(my.test.scores)
[1] FALSE FALSE TRUE TRUE
my.test.scores[!is.na(my.test.scores)]
[1] 91 93
\( \rightarrow \) Data frames are the most common way to handle data sets in R.
x.df <- data.frame(xNum, xLog, xChar)
x.df
xNum xLog xChar
1 1.00000 TRUE foo
2 3.14159 FALSE bar
3 5.00000 TRUE boo
4 7.00000 TRUE far
Data frames have names and are indexed row, column.
x.df[2,1]
[1] 3.14159
x.df[1,3]
[1] foo
Levels: bar boo far foo
By default, text data is converted to factors. You'll often want to turn that off:
x.df[1,3]
[1] foo
Levels: bar boo far foo
x.df <- data.frame(xNum, xLog, xChar, stringsAsFactors=FALSE)
x.df[1,3]
[1] "foo"
x.df[2, ] # all of row 2
xNum xLog xChar
2 3.14159 FALSE bar
x.df[ ,3] # all of column 3
[1] "foo" "bar" "boo" "far"
x.df[2:3, ]
xNum xLog xChar
2 3.14159 FALSE bar
3 5.00000 TRUE boo
x.df[ ,1:2]
xNum xLog
1 1.00000 TRUE
2 3.14159 FALSE
3 5.00000 TRUE
4 7.00000 TRUE
x.df[-3, ] # omit the third observation
xNum xLog xChar
1 1.00000 TRUE foo
2 3.14159 FALSE bar
4 7.00000 TRUE far
x.df[, -2] # omit the second column
xNum xChar
1 1.00000 foo
2 3.14159 bar
3 5.00000 boo
4 7.00000 far
Warning: we're about to delete everything first
rm(list=ls()) # caution, deletes all objects
store.num <- factor(c(3, 14, 21, 32, 54)) # store id
store.rev <- c(543, 654, 345, 678, 234) # store revenue, $K
store.visits <- c(45, 78, 32, 56, 34) # visits, 1000s
store.manager <- c("Annie", "Bert", "Carla", "Dave", "Ella")
(store.df <- data.frame(store.num, store.rev, store.visits,
store.manager, stringsAsFactors=F))
store.num store.rev store.visits store.manager
1 3 543 45 Annie
2 14 654 78 Bert
3 21 345 32 Carla
4 32 678 56 Dave
5 54 234 34 Ella
summary(store.df) # always recommended!
store.num store.rev store.visits store.manager
3 :1 Min. :234.0 Min. :32 Length:5
14:1 1st Qu.:345.0 1st Qu.:34 Class :character
21:1 Median :543.0 Median :45 Mode :character
32:1 Mean :490.8 Mean :49
54:1 3rd Qu.:654.0 3rd Qu.:56
Max. :678.0 Max. :78
store.df$store.manager
[1] "Annie" "Bert" "Carla" "Dave" "Ella"
mean(store.df$store.rev)
[1] 490.8
write.csv(store.df, row.names=FALSE)
"store.num","store.rev","store.visits","store.manager"
"3",543,45,"Annie"
"14",654,78,"Bert"
"21",345,32,"Carla"
"32",678,56,"Dave"
"54",234,34,"Ella"
write.csv(store.df, file="store-df.csv", row.names=FALSE)
read.csv("store-df.csv") # "file=" is optional
store.num store.rev store.visits store.manager
1 3 543 45 Annie
2 14 654 78 Bert
3 21 345 32 Carla
4 32 678 56 Dave
5 54 234 34 Ella
Access the Salaries
data set:
library(car) # install.packages("car") if needed
data(Salaries)
sum()
a logical vector)dim(Salaries) # or even better: str(Salaries)
[1] 397 6
sum(Salaries$yrs.service > 40)
[1] 21
Salaries[Salaries$yrs.service > 20, ] # output not shown
mean(Salaries[Salaries$yrs.service > 20, "salary"])
[1] 122103.9
?Salaries
se <- function(x) { sd(x) / sqrt(length(x)) }
se(store.df$store.visits)
[1] 8.42615
mean(store.df$store.visits) + 1.96 * se(store.df$store.visits)
[1] 65.51525
A function has:
se <- function(x) {
# computes standard error of the mean
tmp.sd <- sd(x) # standard deviation
tmp.N <- length(x) # sample size
tmp.se <- tmp.sd / sqrt(tmp.N) # std error of the mean
return(tmp.se) # return() is optional but clear
}
se(store.df$store.visits)
[1] 8.42615
This is much better! You can examine it to see what it does:
se
function(x) {
# computes standard error of the mean
tmp.sd <- sd(x) # standard deviation
tmp.N <- length(x) # sample size
tmp.se <- tmp.sd / sqrt(tmp.N) # std error of the mean
return(tmp.se) # return() is optional but clear
}
The seq() function constructs sequences in various ways:
seq(from=-5, to=28, by=4)
[1] -5 -1 3 7 11 15 19 23 27
seq(from=-5, to=28, length=6)
[1] -5.0 1.6 8.2 14.8 21.4 28.0
The rep() (repeat) function is also useful. It is especially good for constructing indices into data sets with repeating structure:
rep(c(1,2,3), each=3)
[1] 1 1 1 2 2 2 3 3 3
rep(seq(from=-3, to=13, by=4), c(1, 2, 3, 2, 1))
[1] -3 1 1 5 5 5 9 9 13
1/0
[1] Inf
log(c(-1,0,1))
[1] NaN -Inf 0
sqrt(-2)
[1] NaN
sqrt(2i)
[1] 1+1i
You can use these values yourself (occasionally it makes sense):
10 < Inf
[1] TRUE
save(store.df, file="store-df-backup.RData")
rm(store.df)
mean(store.df$store.rev) # error
Error in mean(store.df$store.rev): object 'store.df' not found
load("store-df-backup.RData")
mean(store.df$store.rev) # works now
[1] 490.8
store.df <- 5
store.df
[1] 5
load("store-df-backup.RData")
store.df
store.num store.rev store.visits store.manager
1 3 543 45 Annie
2 14 654 78 Bert
3 21 345 32 Carla
4 32 678 56 Dave
5 54 234 34 Ella
Save to “.Rdata”:
Save to arbitrary filename
Load an image
This presentation is based on Chapter 6 of Chapman and Feit, R for Marketing Research and Analytics © 2015 Springer. http://r-marketing.r-forge.r-project.org/
Exercises here use the Salaries
data set from the car
package, John Fox and Sanford Weisberg (2011). An R Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
All code in the presentation is licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.