R for Marketing Research and Analytics
========================================================
Author: Chris Chapman and Elea McDonnell Feit
Date: January 2016
css: ../chapman-feit-slides.css
width: 1300
height: 768
**Chapter 2: Basics of the R Language**
Website for all data files:
[http://r-marketing.r-forge.r-project.org/data.html](http://r-marketing.r-forge.r-project.org/data.html)
Objects
========
type: section
$\rightarrow$ We'll cover some of the basic object types in R
$\rightarrow$ Objects in R include variables, data sets, and functions
Basic objects
========
The assignment operator <- assigns a value to a named object.
```{r}
x <- c(2, 4, 6, 8)
x
```
Object names are case sensitive. Instead of 'x', 'X' produces an error:
```{r, error=TRUE}
X
```
Vectors
========
We've just seen how to create a vector: the c() function concatenates individual items into a vector.
```{r}
xNum <- c(1, 3.14159, 5, 7)
xNum
xLog <- c(TRUE, FALSE, TRUE, TRUE)
xLog
xChar <- c("foo", "bar", "boo", "far")
xChar
```
Vectors: Type Coercion
========
A vector can only hold a single type of value (number, text, etc).
Values are coerced to the most general type.
```{r}
xMix <- c(1, TRUE, 3, "Hello, world!")
xMix
```
More about vectors
======
c() can be used to add vectors just as it adds single items:
```{r, collapse=TRUE}
x2 <- c(x, x)
x2
```
Type coercion will be applied as needed:
```{r}
c(x2, 100)
c(x2, "Hello")
```
Forcing coercion
======
```{r}
xMix
xMix[1] # we'll see more on indices later
as.numeric(xMix[1]) # forces it to "numeric"
as.numeric(xMix[1]) + 1.5
```
Help!
========
There are many ways to get help for R:
Command/Source | Note
---------------------|--------------------------------------------
**R**: ?*someword* | to get help on *someword* that R knows
**R**: ??*someword* | to search all R help files for the word in text
**R**: ? or ??"some string" | search for a string, character, etc. that doesn't work as a word
**R**: vignette() | list all the vignettes available
**R**: vignette("zoo") | open the vignette named (for package) "zoo"
**Web**: CRAN task view | Suggested packages by area such as Econometrics, Clustering, etc. https://cran.r-project.org/web/views/
**Web**: R help list | Monitored by volunteers with many R experts and authors
**Web**: Google | Understands "R" in many contexts
**Web**: Stack Overflow | Often great contributions, http://stackoverflow.com/questions/tagged/r
Summary
======
The summary() function summarizes an object in a way that is (usually) appropriate for its data type:
```{r}
summary(xNum)
summary(xChar)
```
Math on Vectors
======
```{r}
x2
x2 + 1
x2 * pi
```
More complex math
======
R will generalize operations across multiple vectors (or matrices) as best it can:
```{r}
x
x2 # longer than x
(x+cos(0.5)) * x2 # x is recycled to match x2
```
Length and structure
======
So vectors can be recycled. How do you find the length?
```{r}
length(x)
length(x2)
```
A more general solution is to investigate the structure:
```{r}
str(x2)
str(xChar)
```
Sequences and Indexing
========
type: section
Sequences
========
Basic 1-by-1 integer sequences are constructed with the ":" operator:
```{r}
xSeq <- 1:10
xSeq
```
Be careful with operator precedence ... clarify liberally with parentheses:
```{r}
1:5*2
1:(5*2)
```
Indexing a vector 1
========
Basic indexing uses a set of integers to select positions:
```{r}
xNum
xNum[2:4]
xNum[c(1,3)]
```
Variables and math operators can be used as well:
```{r}
myStart <- 2
xNum[myStart:sqrt(myStart+7)]
```
Negative indexing
========
A negative index omits elements (returns everything else):
```{r}
xSeq
xSeq[-5:-7]
```
Indexing with Boolean
========
Boolean values of TRUE and FALSE can be used to select items:
```{r}
xNum
xNum[c(FALSE, TRUE, TRUE, TRUE)]
```
This is most often used with comparative operators to select items:
```{r}
xNum > 3
xNum[xNum > 3]
```
Missing and interesting values
========
```{r}
my.test.scores <- c(91, 93, NA, NA)
mean(my.test.scores)
max(my.test.scores)
mean(my.test.scores, na.rm=TRUE)
max(my.test.scores, na.rm=TRUE)
```
Other ways to omit
========
```{r}
na.omit(my.test.scores)
mean(na.omit(my.test.scores))
is.na(my.test.scores)
my.test.scores[!is.na(my.test.scores)]
```
More Complex Structures
=======
type: section
Lists
=======
* Skipping lists for today's tutorial
* They are important but not directly used as often as other data formats
* See the book for detail, section 2.4.7
Data frames
========
type: section
$\rightarrow$ Data frames are the most common way to handle data sets in R.
Data frames 1
========
```{r}
x.df <- data.frame(xNum, xLog, xChar)
x.df
```
Data frames have *names* and are indexed *row, column*.
```{r}
x.df[2,1]
x.df[1,3]
```
Data frames 2
========
By default, text data is converted to factors. You'll often want to turn that off:
```{r}
x.df[1,3]
x.df <- data.frame(xNum, xLog, xChar, stringsAsFactors=FALSE)
x.df[1,3]
```
Indexing data frames
========
```{r}
x.df[2, ] # all of row 2
x.df[ ,3] # all of column 3
x.df[2:3, ]
x.df[ ,1:2]
```
Negative indexing data frames
========
```{r}
x.df[-3, ] # omit the third observation
x.df[, -2] # omit the second column
```
Let's create more interesting data
========
type: alert
Warning: we're about to delete everything first
```{r}
rm(list=ls()) # caution, deletes all objects
```
Store data
========
```{r}
store.num <- factor(c(3, 14, 21, 32, 54)) # store id
store.rev <- c(543, 654, 345, 678, 234) # store revenue, $K
store.visits <- c(45, 78, 32, 56, 34) # visits, 1000s
store.manager <- c("Annie", "Bert", "Carla", "Dave", "Ella")
(store.df <- data.frame(store.num, store.rev, store.visits,
store.manager, stringsAsFactors=F))
```
Some data checks
========
```{r}
summary(store.df) # always recommended!
store.df$store.manager
mean(store.df$store.rev)
```
Read and write CSVs
========
```{r}
write.csv(store.df, row.names=FALSE)
write.csv(store.df, file="store-df.csv", row.names=FALSE)
read.csv("store-df.csv") # "file=" is optional
```
Exercises
=====
type: section
Exercise!
=======
Access the `Salaries` data set:
```{r}
library(car) # install.packages("car") if needed
data(Salaries)
```
1. How many variables and observations are there in the data set?
2. How many professors have more than 40 years of service?
($\rightarrow$ hint: you can **`sum()`** a logical vector)
3. How many have salary > $150000?
4. What is the mean salary for professors with >20 years service?
5. How do you find out more about the data set?
One Set of Answers
=======
1. How many variables and observations are there in the data?
2. How many professors have more than 40 years of service?
3. Which observations have < 1 year of service?
4. What is the mean salary for professors with >20 years service?
5. How do you find out more about the data set?
```{r}
dim(Salaries) # or even better: str(Salaries)
sum(Salaries$yrs.service > 40)
```
```{r, results="hide"}
Salaries[Salaries$yrs.service > 20, ] # output not shown
```
```{r}
mean(Salaries[Salaries$yrs.service > 20, "salary"])
?Salaries
```
Optional Topics
=====
type: section
- Basic Functions
- Sequences, again
- Interesting numbers
- Load and save raw data
Writing Basic Functions
========
```{r}
se <- function(x) { sd(x) / sqrt(length(x)) }
se(store.df$store.visits)
mean(store.df$store.visits) + 1.96 * se(store.df$store.visits)
```
A function has:
* an assigned name (created with '<-')
* zero or more arguments that it operates on (in () )
* a body (usually in { }) with lines of code
* a return value (the last computed value, by default)
Document your functions inline!
========
```{r}
se <- function(x) {
# computes standard error of the mean
tmp.sd <- sd(x) # standard deviation
tmp.N <- length(x) # sample size
tmp.se <- tmp.sd / sqrt(tmp.N) # std error of the mean
return(tmp.se) # return() is optional but clear
}
se(store.df$store.visits)
```
This is much better! You can examine it to see what it does:
```{r}
se
```
Other ways to make sequences
========
The seq() function constructs sequences in various ways:
```{r}
seq(from=-5, to=28, by=4)
seq(from=-5, to=28, length=6)
```
The rep() (repeat) function is also useful. It is especially good for constructing indices into data sets with repeating structure:
```{r}
rep(c(1,2,3), each=3)
rep(seq(from=-3, to=13, by=4), c(1, 2, 3, 2, 1))
```
Infinite and Impossible numbers
========
```{r}
1/0
log(c(-1,0,1))
sqrt(-2)
sqrt(2i)
```
You can use these values yourself (occasionally it makes sense):
```{r}
10 < Inf
```
Loading and saving raw data formats
========
```{r, error=TRUE}
save(store.df, file="store-df-backup.RData")
rm(store.df)
mean(store.df$store.rev) # error
load("store-df-backup.RData")
mean(store.df$store.rev) # works now
```
Loading data has silent overwrite
========
```{r}
store.df <- 5
store.df
load("store-df-backup.RData")
store.df
```
Saving images (but generally, don't!)
========
Save to ".Rdata":
* save.image()
Save to arbitrary filename
* save.image("mywork.RData")
Load an image
* load("mywork.RData")
That's all for Chapter 2!
=========
type: section
# Break time
Notes
========
This presentation is based on Chapter 6 of Chapman and Feit, *R for Marketing Research and Analytics* © 2015 Springer. http://r-marketing.r-forge.r-project.org/
Exercises here use the `Salaries` data set from the `car` package, John Fox and Sanford Weisberg (2011). *An R Companion to Applied Regression*, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
All code in the presentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.