Chris Chapman and Elea McDonnell Feit
February 2016
Chapter 7: Identifying Drivers of Outcomes: Linear Models
Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html
Data represents customer responses to a survey about their satisfaction with diferent aspects of their recent visit to an amusement park.
Image source: hersheypark.com
To load the data:
sat.df <- read.csv("http://goo.gl/HKnl74")
summary(sat.df)
weekend num.child distance rides
no :259 Min. :0.000 Min. : 0.5267 Min. : 72.00
yes:241 1st Qu.:0.000 1st Qu.: 10.3181 1st Qu.: 82.00
Median :2.000 Median : 19.0191 Median : 86.00
Mean :1.738 Mean : 31.0475 Mean : 85.85
3rd Qu.:3.000 3rd Qu.: 39.5821 3rd Qu.: 90.00
Max. :5.000 Max. :239.1921 Max. :100.00
games wait clean overall
Min. : 57.00 Min. : 40.0 Min. : 74.0 Min. : 6.00
1st Qu.: 73.00 1st Qu.: 62.0 1st Qu.: 84.0 1st Qu.: 40.00
Median : 78.00 Median : 70.0 Median : 88.0 Median : 50.00
Mean : 78.67 Mean : 69.9 Mean : 87.9 Mean : 51.26
3rd Qu.: 85.00 3rd Qu.: 77.0 3rd Qu.: 91.0 3rd Qu.: 62.00
Max. :100.00 Max. :100.0 Max. :100.0 Max. :100.00
weekend
: was the visit on a weekend
num.child
: how may children were in the party
distance
: how far did the party travel to the park
rides
, games
, wait
, clean
, overall
: satisfaction ratings
We'll cover how to fit a linear model, i.e. a linear regression, using the lm()
function in R. Linear models relate one or more predictors (independant variables) to an outcome (dependant variables).
Key steps in linear modeling:
A scatterplot matrix can help you quickly visualize the relationships between pairs of variables in the data. Skewness of predictors or correlations between predictors are potential problems.
library(gpairs)
gpairs(sat.df)
hist(sat.df$distance)
sat.df$logdist <- log(sat.df$distance)
hist(sat.df$logdist)
library(corrplot)
corrplot.mixed(cor(sat.df[ , c(2, 4:9)]), upper="ellipse")
To fit a model relating the ratings for rides to the overall satisfation rating we use lm()
. lm()
uses R's formula notation overall ~ rides
to indicate that you want a model predicting overall
as a function of rides
.
lm(overall ~ rides, data=sat.df)
Call:
lm(formula = overall ~ rides, data = sat.df)
Coefficients:
(Intercept) rides
-94.962 1.703
So, if a visitor gives a rating of 95 for rides, then our prediction for their overall rating is:
-94.962 + 1.703*95
[1] 66.823
Typically, we assign the output of lm()
to a variable.
m1 <- lm(overall ~ rides, data=sat.df)
str(m1)
List of 12
$ coefficients : Named num [1:2] -95 1.7
..- attr(*, "names")= chr [1:2] "(Intercept)" "rides"
$ residuals : Named num [1:500] -6.22 11.78 11.18 -17.93 19.89 ...
..- attr(*, "names")= chr [1:500] "1" "2" "3" "4" ...
$ effects : Named num [1:500] -1146.2 -207.9 11.5 -17.9 20.3 ...
..- attr(*, "names")= chr [1:500] "(Intercept)" "rides" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:500] 53.2 53.2 49.8 54.9 48.1 ...
..- attr(*, "names")= chr [1:500] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:500, 1:2] -22.3607 0.0447 0.0447 0.0447 0.0447 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:500] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "rides"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.04 1.01
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 498
$ xlevels : Named list()
$ call : language lm(formula = overall ~ rides, data = sat.df)
$ terms :Classes 'terms', 'formula' language overall ~ rides
.. ..- attr(*, "variables")= language list(overall, rides)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "overall" "rides"
.. .. .. ..$ : chr "rides"
.. ..- attr(*, "term.labels")= chr "rides"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(overall, rides)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "overall" "rides"
$ model :'data.frame': 500 obs. of 2 variables:
..$ overall: int [1:500] 47 65 61 37 68 27 40 30 58 36 ...
..$ rides : int [1:500] 87 87 85 88 84 81 77 82 90 88 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language overall ~ rides
.. .. ..- attr(*, "variables")= language list(overall, rides)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "overall" "rides"
.. .. .. .. ..$ : chr "rides"
.. .. ..- attr(*, "term.labels")= chr "rides"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(overall, rides)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "overall" "rides"
- attr(*, "class")= chr "lm"
Using the model object m1
we can draw our linear model on a plot. The function abline()
takes the model object m1
as input.
plot(overall ~ rides, data=sat.df,
xlab="Satisfaction with Rides", ylab="Overall Satisfaction")
abline(m1, col='blue')
To get a complete report on our model we type:
summary(m1)
Call:
lm(formula = overall ~ rides, data = sat.df)
Residuals:
Min 1Q Median 3Q Max
-33.597 -10.048 0.425 8.694 34.699
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -94.9622 9.0790 -10.46 <2e-16 ***
rides 1.7033 0.1055 16.14 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.88 on 498 degrees of freedom
Multiple R-squared: 0.3434, Adjusted R-squared: 0.3421
F-statistic: 260.4 on 1 and 498 DF, p-value: < 2.2e-16
Linear models assume that the relationship between the predictor and the outcome is linear. If this is not the case,you may get unexpected results.
x <- rnorm(500)
y <- x^2 + rnorm(500)
toy.model <- lm(y ~ x)
summary(toy.model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.9241 -1.0369 -0.1899 0.6904 13.2698
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.98632 0.08111 12.160 <2e-16 ***
x 0.14705 0.08380 1.755 0.0799 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.814 on 498 degrees of freedom
Multiple R-squared: 0.006146, Adjusted R-squared: 0.00415
F-statistic: 3.08 on 1 and 498 DF, p-value: 0.0799
A plot will often reveal the violation of linear model assumptions. So, plot early and often!
plot(y ~ x)
abline(toy.model)
Returning to our satisfaction drivers example, we can use R's standard set of plots for assessing model fit and detecting violations of model assumptions.
par(mfrow=c(2,2))
plot(m1)
We should inspect any outliers to see if there is a problem with the data, such a miscoding or a data entry error.
sat.df[c(57, 129, 295),]
weekend num.child distance rides games wait clean overall logdist
57 yes 2 63.29248 98 87 89 100 100 4.147767
129 yes 0 11.89550 76 77 51 77 6 2.476161
295 no 0 11.74474 98 83 63 92 45 2.463406
Nothing unusual or surprising here!
What we really want to know is which features of the park are most closely related to overall satisfaction. To do that, we fit a model where all the features of the park are included as predictors.
Fit a model relating overall
satisfaction to satisfaction with all four features.
m2 <- lm(overall ~ rides + games + wait + clean, data=sat.df)
summary(m2)
Call:
lm(formula = overall ~ rides + games + wait + clean, data = sat.df)
Residuals:
Min 1Q Median 3Q Max
-29.944 -6.841 1.072 7.167 28.618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -131.40919 8.33377 -15.768 < 2e-16 ***
rides 0.52908 0.14207 3.724 0.000219 ***
games 0.15334 0.06908 2.220 0.026903 *
wait 0.55333 0.04781 11.573 < 2e-16 ***
clean 0.98421 0.15987 6.156 1.54e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.59 on 495 degrees of freedom
Multiple R-squared: 0.5586, Adjusted R-squared: 0.5551
F-statistic: 156.6 on 4 and 495 DF, p-value: < 2.2e-16
The model coefficients tell us the association between rides
, games
, wait
and clean
and overall
satisfaction. This can help the park come up with strategies to improve overall satisfaction.
Note that relationships may not be causal. The model tells us that in the past, customers who were more satisfied overall were also more satisfied with cleanliness. Improving satisfaction with clean
may not ultimately raise overall satisfaction.
In deciding what actions to take to improve satisfaction, the park also needs to consider how easy or hard it would be to change satisfaction with clean
versus rides
or wait
.
A nice way to visualize the findings from the model is to plot the coefficients, which describe the relationship between each feature and overall satisfaction.
# library(coefplot)
# coefplot(m2, intercept=FALSE, outerCI=1.96, lwdOuter=1.5,
# ylab="Rating of Feature",
# xlab="Association with Overall Satisfaction")
R-squared is a measure of how well the model predicts:
summary(m1)$r.squared # single predictor: rides
[1] 0.3433799
summary(m2)$r.squared # multple predictors
[1] 0.558621
Adjusted R-squared makes an adjustment to account for the fact that models with more predictors will predict better.
summary(m1)$adj.r.squared
[1] 0.3420614
summary(m2)$adj.r.squared
[1] 0.5550543
We can compare two models visually. Plot early, plot often!
plot(sat.df$overall, fitted(m1), col='red',
xlim=c(0,100), ylim=c(0,100),
xlab="Actual Overall Satisfaction", ylab="Fitted Overall Satisfaction")
points(sat.df$overall, fitted(m2), col='blue')
legend("topleft", legend=c("model 1", "model 2"),
col=c("red", "blue"), pch=1)
For nested models (the predictors in m2
includes all the predictors in m1
plus some additional predictors) there is a formal test of whether the additional variables improve the fit of the model:
anova(m1, m2)
Analysis of Variance Table
Model 1: overall ~ rides
Model 2: overall ~ rides + games + wait + clean
Res.Df RSS Df Sum of Sq F Pr(>F)
1 498 82612
2 495 55532 3 27080 80.463 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can make a prediction for what overall ratings a customer would give, if she gave 100 ratings for all the park features.
coef(m2)["(Intercept)"] + coef(m2)["rides"]*100 + coef(m2)["games"]*100 +
coef(m2)["wait"]*100 + coef(m2)["clean"]*100
(Intercept)
90.58612
Or more easily using R's matrix multiplication.
coef(m2)%*%c(1, 100, 100, 100, 100)
[,1]
[1,] 90.58612
Or using the built-in functions.
predict(m2, sat.df[1:10, ]) # first 10 observations
1 2 3 4 5 6 7 8
46.60864 54.26012 51.17289 50.30434 52.94625 27.87214 36.27435 43.13123
9 10
66.91439 45.38024
fitted(m2)[1:10] # same, automatically in model object
1 2 3 4 5 6 7 8
46.60864 54.26012 51.17289 50.30434 52.94625 27.87214 36.27435 43.13123
9 10
66.91439 45.38024
When you want to comapre coefficients, it can be helpful to put the predictors on the same scale by standarding them.
You can do that by hand:
head(sat.df$rides - mean(sat.df$rides)) / sd(sat.df$rides)
[1] 0.2112477 0.2112477 -0.1548662 0.3943047 -0.3379232 -0.8870941
Or using R's built-in function, scale()
head(scale(sat.df$rides))
[,1]
[1,] 0.2112477
[2,] 0.2112477
[3,] -0.1548662
[4,] 0.3943047
[5,] -0.3379232
[6,] -0.8870941
So that we don't have to worry about scale, we will standardize the whole data set.
sat.std <- sat.df[ , -3] # sat but remove distance
sat.std[ , 3:7] <- scale(sat.std[ , 3:7])
sat.std$logdist <- log(sat.df$distance) # add transformed distance
head(sat.std)
weekend num.child rides games wait clean
1 yes 0 0.2112477 -0.69750817 -0.918784090 0.21544189
2 yes 2 0.2112477 -0.08198737 0.566719693 -0.17555973
3 no 1 -0.1548662 0.16422095 0.009655775 0.01994108
4 yes 0 0.3943047 -0.82061233 -0.361720171 0.21544189
5 no 4 -0.3379232 1.02595006 0.381031720 -0.17555973
6 no 5 -0.8870941 0.04111679 -2.032911927 -1.73956621
overall logdist
1 -0.2681587 4.741869
2 0.8654385 3.296359
3 0.6135280 4.147901
4 -0.8979350 3.254626
5 1.0543714 4.002198
6 -1.5277112 3.121454
summary(sat.std)
weekend num.child rides games
no :259 Min. :0.000 Min. :-2.53461 Min. :-2.66717
yes:241 1st Qu.:0.000 1st Qu.:-0.70404 1st Qu.:-0.69751
Median :2.000 Median : 0.02819 Median :-0.08199
Mean :1.738 Mean : 0.00000 Mean : 0.00000
3rd Qu.:3.000 3rd Qu.: 0.76042 3rd Qu.: 0.77974
Max. :5.000 Max. : 2.59099 Max. : 2.62630
wait clean overall
Min. :-2.775664 Min. :-2.71707 Min. :-2.85024
1st Qu.:-0.733096 1st Qu.:-0.76206 1st Qu.:-0.70900
Median : 0.009656 Median : 0.01994 Median :-0.07923
Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.659564 3rd Qu.: 0.60644 3rd Qu.: 0.67651
Max. : 2.794975 Max. : 2.36595 Max. : 3.06966
logdist
Min. :-0.6411
1st Qu.: 2.3339
Median : 2.9454
Mean : 2.9782
3rd Qu.: 3.6784
Max. : 5.4773
What is odd about this model?
m3 <- lm(overall ~ rides + games + wait + clean +
weekend + logdist + num.child, data = sat.std)
summary(m3)
Call:
lm(formula = overall ~ rides + games + wait + clean + weekend +
logdist + num.child, data = sat.std)
Residuals:
Min 1Q Median 3Q Max
-1.51427 -0.40271 0.01142 0.41613 1.69000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.56814 0.09275 -6.125 1.85e-09 ***
rides 0.21288 0.04197 5.073 5.57e-07 ***
games 0.07066 0.03026 2.335 0.0199 *
wait 0.38138 0.02777 13.734 < 2e-16 ***
clean 0.29690 0.04415 6.725 4.89e-11 ***
weekendyes -0.04589 0.05141 -0.893 0.3725
logdist 0.06562 0.02608 2.516 0.0122 *
num.child 0.22717 0.01711 13.274 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5709 on 492 degrees of freedom
Multiple R-squared: 0.6786, Adjusted R-squared: 0.674
F-statistic: 148.4 on 7 and 492 DF, p-value: < 2.2e-16
sat.std$num.child.factor <- factor(sat.std$num.child)
m4 <- lm(overall ~ rides + games + wait + clean +
weekend + logdist + num.child.factor, data=sat.std)
summary(m4)
Call:
lm(formula = overall ~ rides + games + wait + clean + weekend +
logdist + num.child.factor, data = sat.std)
Residuals:
Min 1Q Median 3Q Max
-1.25923 -0.35048 -0.00154 0.31400 1.52690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.78728 0.07941 -9.914 < 2e-16 ***
rides 0.22313 0.03541 6.301 6.61e-10 ***
games 0.04258 0.02551 1.669 0.0958 .
wait 0.38472 0.02338 16.453 < 2e-16 ***
clean 0.30917 0.03722 8.308 9.72e-16 ***
weekendyes -0.02227 0.04322 -0.515 0.6065
logdist 0.03233 0.02203 1.467 0.1429
num.child.factor1 1.01610 0.07130 14.250 < 2e-16 ***
num.child.factor2 1.03732 0.05640 18.393 < 2e-16 ***
num.child.factor3 0.98000 0.07022 13.955 < 2e-16 ***
num.child.factor4 0.93154 0.08032 11.598 < 2e-16 ***
num.child.factor5 1.00193 0.10369 9.663 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4795 on 488 degrees of freedom
Multiple R-squared: 0.7751, Adjusted R-squared: 0.77
F-statistic: 152.9 on 11 and 488 DF, p-value: < 2.2e-16
sat.std$has.child <- factor(sat.std$num.child > 0)
m5 <- lm(overall ~ rides + games + wait + clean + logdist + has.child,
data=sat.std)
summary(m5)
Call:
lm(formula = overall ~ rides + games + wait + clean + logdist +
has.child, data = sat.std)
Residuals:
Min 1Q Median 3Q Max
-1.23491 -0.35539 -0.00838 0.32435 1.46624
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.80802 0.07391 -10.932 < 2e-16 ***
rides 0.22272 0.03512 6.342 5.12e-10 ***
games 0.04424 0.02539 1.742 0.0821 .
wait 0.38582 0.02326 16.589 < 2e-16 ***
clean 0.30876 0.03696 8.354 6.75e-16 ***
logdist 0.03562 0.02179 1.635 0.1027
has.childTRUE 1.00565 0.04683 21.472 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4782 on 493 degrees of freedom
Multiple R-squared: 0.7741, Adjusted R-squared: 0.7713
F-statistic: 281.5 on 6 and 493 DF, p-value: < 2.2e-16
An interaction between two predictors occurs when the effect of one predictor depends on the value of the other predictor.
(m7 <- lm(overall ~ rides + games + wait + clean + logdist + has.child +
wait:has.child,
data=sat.std))
Call:
lm(formula = overall ~ rides + games + wait + clean + logdist +
has.child + wait:has.child, data = sat.std)
Coefficients:
(Intercept) rides games
-0.78133 0.21264 0.04870
wait clean logdist
0.15095 0.30244 0.02961
has.childTRUE wait:has.childTRUE
0.99830 0.34688
library(coefplot) # NB: recent library problems, using image
coefplot(m7, intercept=FALSE, outerCI=1.96, lwdOuter=1.5,
ylab="Rating of Feature",
xlab="Association with Overall Satisfaction")
Formulas are used in many R functions and there are many variations on them.
You can play around with adding or removing interactions, squared predictors and so forth, but be cautious about overfitting.
log()
or other tranforms.scale()
.plot()
to get standard diagnostic plots and inspect them for model fit and outliers. anova()
.The Bayesian version of the linear model is also easy to run in R.
library(MCMCpack)
m7.b <- MCMCregress(overall ~ rides + games + wait + clean + logdist +
has.child + wait:has.child, data=sat.std)
summary(m7.b)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) -0.78162 0.07010 0.0007010 0.0007010
rides 0.21262 0.03351 0.0003351 0.0003297
games 0.04885 0.02400 0.0002400 0.0002400
wait 0.15096 0.03683 0.0003683 0.0003683
clean 0.30205 0.03515 0.0003515 0.0003515
logdist 0.02953 0.02066 0.0002066 0.0002066
has.childTRUE 0.99872 0.04414 0.0004414 0.0004478
wait:has.childTRUE 0.34732 0.04359 0.0004359 0.0004359
sigma2 0.20374 0.01306 0.0001306 0.0001306
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) -0.915802 -0.82914 -0.78189 -0.73484 -0.64289
rides 0.145492 0.19004 0.21289 0.23501 0.27819
games 0.001618 0.03284 0.04883 0.06457 0.09679
wait 0.079464 0.12628 0.15060 0.17601 0.22354
clean 0.233651 0.27852 0.30216 0.32582 0.37075
logdist -0.011536 0.01575 0.02959 0.04371 0.06920
has.childTRUE 0.911439 0.96929 0.99842 1.02837 1.08416
wait:has.childTRUE 0.261227 0.31781 0.34720 0.37714 0.43193
sigma2 0.179781 0.19454 0.20311 0.21213 0.23094
Access the Salaries
data set:
library(car) # install.packages("car") if needed
data(Salaries)
Visualize salary. Should it be transformed? If so, how? And if so, do that and include as a new variable.
Salaries$logsalary <- log(Salaries$salary)
par(mfrow=c(1,2))
hist(Salaries$salary) # <== actually do this first
hist(log(Salaries$salary)) # also try sqrt() etc
Fit a linear model for salary as a response of sex, rank, discipline, and years of service. Which effects are statistically significant?
salary.lm <- lm(salary ~ sex + rank + discipline + yrs.service,
data=Salaries)
summary(salary.lm)
Call:
lm(formula = salary ~ sex + rank + discipline + yrs.service,
data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-64202 -14255 -1533 10571 99163
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68351.67 4482.20 15.250 < 2e-16 ***
sexMale 4771.25 3878.00 1.230 0.219311
rankAssocProf 14560.40 4098.32 3.553 0.000428 ***
rankProf 49159.64 3834.49 12.820 < 2e-16 ***
disciplineB 13473.38 2315.50 5.819 1.24e-08 ***
yrs.service -88.78 111.64 -0.795 0.426958
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22650 on 391 degrees of freedom
Multiple R-squared: 0.4478, Adjusted R-squared: 0.4407
F-statistic: 63.41 on 5 and 391 DF, p-value: < 2.2e-16
Plot the coefficients for those (linear model) effects. [Note: this plot shows \( \pm \) 1 standard error, not 95% confidence interval.]
library(coefplot) # install if needed
coefplot(salary.lm)
Repeat the linear model (salary in response to sex, rank, discipline, and years since PhD) using Bayesian estimation. Summarize the results.
library(MCMCpack)
set.seed(98108)
salary.lm.b <- MCMCregress(
salary ~ sex + rank + discipline + yrs.service,
data=Salaries)
options("scipen"=100, "digits"=4) # force non-scientific notation
summary(salary.lm.b)$quantiles
2.5% 25% 50% 75%
(Intercept) 59547.9 65230.0 68373.96 71426.16
sexMale -2796.3 2103.3 4747.86 7405.16
rankAssocProf 6451.9 11861.5 14570.61 17311.99
rankProf 41611.2 46577.3 49200.72 51737.03
disciplineB 8910.1 11895.7 13460.51 15007.37
yrs.service -307.6 -163.5 -89.82 -12.29
sigma2 448001458.5 489995080.7 513985664.78 539697427.20
97.5%
(Intercept) 77157.3
sexMale 12498.5
rankAssocProf 22629.5
rankProf 56678.2
disciplineB 17920.9
yrs.service 133.7
sigma2 593796060.9
This presentation is based on Chapter 7 of Chapman and Feit, R for Marketing Research and Analytics © 2015 Springer.
Exercises here use the Salaries
data set from the car
package, John Fox and Sanford Weisberg (2011). An R Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
All code in the presentation is licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.