| Table of Contents
| Function Reference
| Function Finder
| R Project |
MULTIPLE REGRESSION
Preliminaries
Model Formulae
If you haven't yet read the tutorial on Model
Formulae, now would be a good time!
Statistical Modeling
There is not space in this tutorial, and probably not on this server, to
cover the complex issue of statistical modeling. For an excellent discussion, I
refer you to Chapter 9 of Crawley (2007).
Here I will restrict myself to a discussion of linear relationships. However,
transforming variables to make them linear (see
Simple Nonlinear Correlation and Regression) is straightforward, and a
model including interaction terms is as easily created as it is to change plus
signs to asterisks. (If you don't know what I'm talking about, read the Model Formulae tutorial.) R also includes a wealth of
tools for more complex modeling, among them...
- rlm( ) for robust fitting of linear models (MASS package)
- glm( ) for generalized linear models (stats package)
- gam( ) for generalized additive models (gam package)
- lme( ) and lmer( ) for linear mixed-effects models (nlme and lme4 packages)
- nls( ) and nlme( ) for nonlinear models (stats and nlme packages)
- and I'm sure there are others I'm leaving out
My familiarity with these functions is "less than thorough" (!), so I shall
refrain from mentioning them further.
Warning: An opinion follows! I am very much opposed to the current common
practice in the social sciences (and education) of "letting the software
decide" what statistical model is best. Statistical software is an assistant,
not a master. It should remain in the hands of the researcher or data analyst
which variables are retained in the model and which are excluded. Hopefully,
these decisions will be based on a thorough knowledge of the phenomenon being
modeled. Therefore, I will not discuss techniques such as stepwise regression
or hierarchical regression here. Stepwise regression is implemented in R (see
below for a VERY brief mention), and hierarchical regression (SPSS-like) can
easily enough be done, although it is not automated (that I am aware of). The
basic rule of statistical modeling should be: "all statistical models are wrong"
(George Box).
Letting the software have free reign to include and exclude model variables
in no way changes or improves upon that situation. It is not more "objective,"
and it does not result in better models, only less well understood ones.
Preliminary Examination of the Data
For this analysis, we will use a built-in data set called state.x77. It is
one of many helpful data sets R includes on the states of the U.S. However, it
is not a data frame, so after loading it, we will coerce it into being a data
frame.
> state.x77 # output not shown
> str(state.x77) # clearly not a data frame! (it's a matrix)
> st = as.data.frame(state.x77) # so we'll make it one
> str(st)
'data.frame': 50 obs. of 8 variables:
$ Population: num 3615 365 2212 2110 21198 ...
$ Income : num 3624 6315 4530 3378 5114 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life Exp : num 69.0 69.3 70.5 70.7 71.7 ...
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
$ Area : num 50708 566432 113417 51945 156361 ...
A couple things have to be fixed before this will be convenient to use, and
I also want to add a population density variable. Note that population density
is a linear function of Population and Area, so these three vectors will not be
linearly independent.
> colnames(st)[4] = "Life.Exp" # no spaces in variable names, please
> colnames(st)[6] = "HS.Grad" # ditto
> st$Density = st$Population * 1000 / st$Area # create a new column in st
If you are unclear on what these variables are, or want more information, see
the help page: ?state.x77.
A straightforward summary is always a good place to begin, because for one
thing it will find any variables that have missing values. Then, seeing as how
we are looking for interrelationships between these variables, we should look
at a correlation matrix, or perhaps a scatterplot matrix.
> summary(st)
Population Income Illiteracy Life.Exp Murder
Min. : 365 Min. :3098 Min. :0.500 Min. :67.96 Min. : 1.400
1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12 1st Qu.: 4.350
Median : 2838 Median :4519 Median :0.950 Median :70.67 Median : 6.850
Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88 Mean : 7.378
3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89 3rd Qu.:10.675
Max. :21198 Max. :6315 Max. :2.800 Max. :73.60 Max. :15.100
HS.Grad Frost Area Density
Min. :37.80 Min. : 0.00 Min. : 1049 Min. : 0.6444
1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985 1st Qu.: 25.3352
Median :53.25 Median :114.50 Median : 54277 Median : 73.0154
Mean :53.11 Mean :104.46 Mean : 70736 Mean :149.2245
3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81162 3rd Qu.:144.2828
Max. :67.30 Max. :188.00 Max. :566432 Max. :975.0033
>
> cor(st) # correlation matrix (not shown, yet)
> round(cor(st), 3) # rounding makes it easier to look at
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Density
Population 1.000 0.208 0.108 -0.068 0.344 -0.098 -0.332 0.023 0.246
Income 0.208 1.000 -0.437 0.340 -0.230 0.620 0.226 0.363 0.330
Illiteracy 0.108 -0.437 1.000 -0.588 0.703 -0.657 -0.672 0.077 0.009
Life.Exp -0.068 0.340 -0.588 1.000 -0.781 0.582 0.262 -0.107 0.091
Murder 0.344 -0.230 0.703 -0.781 1.000 -0.488 -0.539 0.228 -0.185
HS.Grad -0.098 0.620 -0.657 0.582 -0.488 1.000 0.367 0.334 -0.088
Frost -0.332 0.226 -0.672 0.262 -0.539 0.367 1.000 0.059 0.002
Area 0.023 0.363 0.077 -0.107 0.228 0.334 0.059 1.000 -0.341
Density 0.246 0.330 0.009 0.091 -0.185 -0.088 0.002 -0.341 1.000
>
> pairs(st) # scatterplot matrix; plot(st) does the same thing
That, gives us plenty to look at, doesn't it?
I propose we cast "Life.Exp" into the role of our response variable and
attempt to model (predict) it from the other variables. Already we see some
interesting relationships. Life expectancy appears to be moderately to strongly
related to "Income" (positively), "Illiteracy" (negatively), "Murder"
(negatively), "HS.Grad" (positively), and "Frost" (no. of days per year with
frost, positively), and I call particular attention to this last relationship.
These are all bivariate relationships--i.e., they don't take into account any
of the remaining variables.
Here's another interesting relationship, which we would find statistically
significant. The correlation between average SAT scores by state and
average teacher pay by state is negative. That is to say, the states with higher
teacher pay have lower SAT scores. (How much political hay has been made of
that? These politicians either don't understand the nature of regression
relationships, or they are spinning the data to tell the story they want you
to hear!!) We would also find a very strong negative correlation between SAT
scores and percent of eligible students who take the test. I.e., the higher the
percentage of HS seniors who take the SAT, the lower the average score. Here is
the crucial point. Teacher pay is positively correlated with percent of students
who take the test. There is a potential confound here. In regression terms, we
would call percent of students who take the test a "lurking variable."
The advantage of looking at these variables in a multiple regression is we
are able to see the relationship between two variables with the others taken
out. To say this another way, the regression analysis "holds the potential
lurking variables constant" while it is figuring out the relationship between
the variables of interest. When SAT scores, teacher pay, and percent taking the
test are all entered into the regression relationship, the confound between
pay and percent is controlled, and the relationship between teacher pay and SAT
scores is shown to be significantly positive. (Note: If you get the Faraway
data sets when we do the ANCOVA tutorial, you will
have a data set, sat.txt, that will allow you to examine this. It's an
interesting problem and a recommended exercise.)
The point of this long discussion is this. Life expectancy does have a
bivariate relationship with a lot of the other variables, but many of those
variables are also related to each other. Multiple regression allows us to tease
all of this apart and see in a "purer" (but not pure) form what the relationship
really is between, say, life expectancy and average number of days with frost. I
suggest you watch this relationship as we proceed through the analysis.
We should also examine the variables for outliers and distribution before
proceeding, but for the sake of brevity, we'll skip it. But a quick way to do
it all at once would be:
> par(mfrow=c(3,3))
> for(i in 1:9) qqnorm(st[,i], main=colnames(st)[i])
The Maximal or Full Model (Sans Interactions)
We begin by throwing all the predictors into a linear additive model.
> options(show.signif.stars=F) # I don't like significance stars!
> names(st) # for handy reference
[1] "Population" "Income" "Illiteracy" "Life.Exp" "Murder"
[6] "HS.Grad" "Frost" "Area" "Density"
> ### model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder +
+ ### HS.Grad + Frost + Area + Density, data=st)
> ### This is what we're going to accomplish, but more economically, by
> ### simply placing a dot after the tilde, which means "everything else."
> model1 = lm(Life.Exp ~ ., data=st)
> summary(model1)
Call:
lm(formula = Life.Exp ~ ., data = st)
Residuals:
Min 1Q Median 3Q Max
-1.47514 -0.45887 -0.06352 0.59362 1.21823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.995e+01 1.843e+00 37.956 < 2e-16
Population 6.480e-05 3.001e-05 2.159 0.0367
Income 2.701e-04 3.087e-04 0.875 0.3867
Illiteracy 3.029e-01 4.024e-01 0.753 0.4559
Murder -3.286e-01 4.941e-02 -6.652 5.12e-08
HS.Grad 4.291e-02 2.332e-02 1.840 0.0730
Frost -4.580e-03 3.189e-03 -1.436 0.1585
Area -1.558e-06 1.914e-06 -0.814 0.4205
Density -1.105e-03 7.312e-04 -1.511 0.1385
Residual standard error: 0.7337 on 41 degrees of freedom
Multiple R-squared: 0.7501, Adjusted R-squared: 0.7013
F-statistic: 15.38 on 8 and 41 DF, p-value: 3.787e-10
It appears higher populations are related to increased life expectancy and
higher murder rates are strongly related to decreased life expectancy. High
school graduation rate is marginal. Other
than that, we're not seeing much. Another kind of summary of the model can be
obtained like this.
> summary.aov(model1)
Df Sum Sq Mean Sq F value Pr(>F)
Population 1 0.4089 0.4089 0.7597 0.388493
Income 1 11.5946 11.5946 21.5413 3.528e-05
Illiteracy 1 19.4207 19.4207 36.0811 4.232e-07
Murder 1 27.4288 27.4288 50.9591 1.051e-08
HS.Grad 1 4.0989 4.0989 7.6152 0.008612
Frost 1 2.0488 2.0488 3.8063 0.057916
Area 1 0.0011 0.0011 0.0020 0.964381
Density 1 1.2288 1.2288 2.2830 0.138472
Residuals 41 22.0683 0.5383
These are sequential tests on the terms, as if they were entered one at a
time in a hierarchical fashion, beginning with Population by itself, then
Income added to Population, then Illiteracy added to Population and Income,
etc. Notice that the test on Density yields the same p-value, because in the
sequential tests it is entered last, after everything else, just as it is (and
all the other terms are) in the regression analysis. Make what you will of
these tests. I'm not saying they aren't useful or interesting, but I won't be
referring to them again.
A note on interactions. What about the possibility that there might
be important interactions between some of these terms? It's a legitimate
question, and I'll address it in a little bit, but including interactions in
a regression model raises some special issues. In this case particularly, it
would probably be a bad idea to test too many because of the linear dependence
among some of our variables. Even without that problem though, testing
interactions usually involves centering the variables, AND just imagine how
many of them there would be! Plus, interpreting higher order interactions can
be difficult, to say the least. I think we need to simplify our model a bit
before we start fooling with interactions. I will mention them again below.
Now we need to start winnowing down our model to a minimal adequate one. The
least significant of the slopes in the above summary table (two tables ago--NOT
the sequential ANOVA table) was that for
"Illiteracy", so some would say that term must go first. I beg to differ. Rather
than letting a bunch of (I assume) insentient electrons make this decision for
me, I'm going to take matters into my own (hopefully) capable hands and let the
sodium and potassium ions in my brain make the decision. I should decide, not
the machine. The machine is my assistant, not my boss!
I like the "Illiteracy" term, and I can see how it might well be related to
life expectancy. I cannot see how "Area" of the state in which a person lives
could be related, except in that it might be confounded with something else.
"Area" has almost as high a p-value as "Illiteracy", so that is the first term
I am going to toss from my model.
The Minimal Adequate Model
We want to reduce our model to a point where all the remaining predictors
are significant, and we want to do this by throwing out one predictor at a time.
"Area" will go first. To do this, we could just recast the model without the
"Area" variable, but R supplies a handier way.
> model2 = update(model1, .~. -Area)
> summary(model2)
Call:
lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +
HS.Grad + Frost + Density, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.5025 -0.4047 -0.0608 0.5868 1.4386
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.094e+01 1.378e+00 51.488 < 2e-16
Population 6.249e-05 2.976e-05 2.100 0.0418
Income 1.485e-04 2.690e-04 0.552 0.5840
Illiteracy 1.452e-01 3.512e-01 0.413 0.6814
Murder -3.319e-01 4.904e-02 -6.768 3.12e-08
HS.Grad 3.746e-02 2.225e-02 1.684 0.0996
Frost -5.533e-03 2.955e-03 -1.873 0.0681
Density -7.995e-04 6.251e-04 -1.279 0.2079
Residual standard error: 0.7307 on 42 degrees of freedom
Multiple R-squared: 0.746, Adjusted R-squared: 0.7037
F-statistic: 17.63 on 7 and 42 DF, p-value: 1.173e-10
The syntax of update() is cryptic, so here's what
it means. Inside the
new formula, a dot means "same." So read the above update command as follows:
"Update model1 using the same response variables and the same predictor
variables, except remove (minus) "Area". Notice that removing "Area" has cost
us very little in terms of R-squared, and the adjusted R-squared actually went
up, due to there being fewer predictors. The two models can be compared as
follows.
> anova(model2, model1) # enter the name of the reduced model first
Analysis of Variance Table
Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Density
Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Area + Density
Res.Df RSS Df Sum of Sq F Pr(>F)
1 42 22.425
2 41 22.068 1 0.35639 0.6621 0.4205
As you can see, removing "Area" had no significant effect on the model
(p = .4205). Compare the p-value to that for "Area" in the first
summary table above.
What goes next, "Income" or "Illiteracy"? I can see how either one of these
could be related to life expectancy. In fact, the only one of these variables
I'm a little leary of at this point is "Frost". But I'm also intrigued by it,
and it is close to being significant. So, my assistant is telling me to toss
out "Illiteracy", and reluctantly I agree.
> model3 = update(model2, .~. -Illiteracy)
> summary(model3)
Call:
lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad +
Frost + Density, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.49555 -0.41246 -0.05336 0.58399 1.50535
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.131e+01 1.042e+00 68.420 < 2e-16
Population 5.811e-05 2.753e-05 2.110 0.0407
Income 1.324e-04 2.636e-04 0.502 0.6181
Murder -3.208e-01 4.054e-02 -7.912 6.32e-10
HS.Grad 3.499e-02 2.122e-02 1.649 0.1065
Frost -6.191e-03 2.465e-03 -2.512 0.0158
Density -7.324e-04 5.978e-04 -1.225 0.2272
Residual standard error: 0.7236 on 43 degrees of freedom
Multiple R-squared: 0.745, Adjusted R-squared: 0.7094
F-statistic: 20.94 on 6 and 43 DF, p-value: 2.632e-11
Things are starting to change a bit. R-squared went down again, as it will
always do when a predictor is removed, but once more, adjusted R-squared
increased. "Frost" is also now a significant predictor of life expectancy, and
unlike in the bivariate relationship we saw originally, the relationship is now
negative.
Looks like "Income" should go next, but gosh darn it, how can income not be
related to life expectancy? Well, you have to be careful how you say that. The
model is NOT telling us that income is unrelated to life expectancy! It's telling
us that, within the parameters of this model, "Income" is now the worst
predictor. A look back at the correlation matrix shows that "Income" is strongly
correlated with "HS.Grad". Maybe "HS.Grad" is stealing some its thunder. On the
other hand, income is important because it can buy you things, like a house in a
neighborhood with good schools and low murder rate, and in the "burbs" (lower
population density). Sorry "Income" but I gotta give you the ax. (NOTE: The
important thing is, I thought about it and came to this conclusion rationally. I
didn't blindly let the numbers tell me what to do--entirely.)
> model4 = update(model3, .~. -Income)
> summary(model4)
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost +
Density, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.56877 -0.40951 -0.04554 0.57362 1.54752
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.142e+01 1.011e+00 70.665 < 2e-16
Population 6.083e-05 2.676e-05 2.273 0.02796
Murder -3.160e-01 3.910e-02 -8.083 3.07e-10
HS.Grad 4.233e-02 1.525e-02 2.776 0.00805
Frost -5.999e-03 2.414e-03 -2.485 0.01682
Density -5.864e-04 5.178e-04 -1.132 0.26360
Residual standard error: 0.7174 on 44 degrees of freedom
Multiple R-squared: 0.7435, Adjusted R-squared: 0.7144
F-statistic: 25.51 on 5 and 44 DF, p-value: 5.524e-12
R-squared went down hardly at all. Adjusted R-squared went up. "Income" will
not be missed FROM THIS MODEL. Now all the predictors are significant except
"Density", and while my heart says stay, I have to admit that looking at the
correlation matrix tells me "Density" isn't really strongly related to
anything.
> model5 = update(model4, .~. -Density)
> summary(model5)
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
data = st)
Residuals:
Min 1Q Median 3Q Max
-1.47095 -0.53464 -0.03701 0.57621 1.50683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16
Population 5.014e-05 2.512e-05 1.996 0.05201
Murder -3.001e-01 3.661e-02 -8.199 1.77e-10
HS.Grad 4.658e-02 1.483e-02 3.142 0.00297
Frost -5.943e-03 2.421e-03 -2.455 0.01802
Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
Adjusted R-squared slipped a bit that time, but not significantly as we could
confirm by...
> anova(model5, model4) # output not shown
What I really want to do here is compare this model to the original full
model.
> anova(model5, model1)
Analysis of Variance Table
Model 1: Life.Exp ~ Population + Murder + HS.Grad + Frost
Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Area + Density
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 23.308
2 41 22.068 4 1.2397 0.5758 0.6818
It looks like we have lost nothing of significance, so to speak. I.e., the two
models are not significantly different.
We have (implicitly by not saying otherwise) set our alpha level at .05, so
now "Population" must go. So say the bean counters. This could have a substantial
effect on the model, as the slope for "Population" is very nearly significant.
We already know that creating a model without it and comparing it to the current
model (model5) would result in a p-value of 0.052. Gosh, it's so close! But as
my old stat professor (Tom Wickens) would say, close only counts in horseshoes
and handgrenades. I have to disagree. I am not one of those who worships the
magic number point-oh-five. I don't see much difference between 0.052 and 0.048.
Before we make this decision, let's look at something else first.
I raised the issue of interactions above. Is it possible that there are some
interactions among these variables that might be important? Granted, we've
already tossed out quite a few of the variables, so maybe it's a little late to
be asking, but better late than never. (You can quote me!) Now that we have few
enough terms where interactions might start to make sense, let's test them all
at once. I will test everything up to the three-way interactions. I'm not
leaving out the four-way interaction for any good reason other than to
demonstrate that I can. Include it if you want.
> model.int = update(model5, .~.^3)
> anova(model5, model.int)
Analysis of Variance Table
Model 1: Life.Exp ~ Population + Murder + HS.Grad + Frost
Model 2: Life.Exp ~ Population + Murder + HS.Grad + Frost + Population:Murder +
Population:HS.Grad + Population:Frost + Murder:HS.Grad +
Murder:Frost + HS.Grad:Frost + Population:Murder:HS.Grad +
Population:Murder:Frost + Population:HS.Grad:Frost + Murder:HS.Grad:Frost
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 23.308
2 35 15.869 10 7.4395 1.6409 0.1356
In a model formula, ^ is a way to specify interactions, so .^3 means include
all the terms in the current model (the dot) and all interactions up to
three-way. If you want to limit it to two-way, then use .^2. If you want the
four-way, then it's .^4. It appears from our resultant comparison of the models
with and without the interactions, that the interactions do not make a
significant contribution to the model. So I'm not going to worry about them
further.
Now we have to make a decision about "Population". My gut says it should
stay, and it's a pretty substantial gut! But to keep the point-oh-fivers
happy, let's get rid of it.
> model6 = update(model5, .~. -Population)
> summary(model6)
Call:
lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.5015 -0.5391 0.1014 0.5921 1.2268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.036379 0.983262 72.246 < 2e-16
Murder -0.283065 0.036731 -7.706 8.04e-10
HS.Grad 0.049949 0.015201 3.286 0.00195
Frost -0.006912 0.002447 -2.824 0.00699
Residual standard error: 0.7427 on 46 degrees of freedom
Multiple R-squared: 0.7127, Adjusted R-squared: 0.6939
F-statistic: 38.03 on 3 and 46 DF, p-value: 1.634e-12
We have reached the minimal adequate model in which all the slopes are
statistically significant.
Stepwise Regression
What we did in the previous section can be automated using the
step() function.
> step(model1, direction="backward")
Start: AIC=-22.89
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Area + Density
Df Sum of Sq RSS AIC
- Illiteracy 1 0.305 22.373 -24.208
- Area 1 0.356 22.425 -24.093
- Income 1 0.412 22.480 -23.969
<none> 22.068 -22.894
- Frost 1 1.110 23.178 -22.440
- Density 1 1.229 23.297 -22.185
- HS.Grad 1 1.823 23.891 -20.926
- Population 1 2.510 24.578 -19.509
- Murder 1 23.817 45.886 11.706
Step: AIC=-24.21
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Area +
Density
Df Sum of Sq RSS AIC
- Area 1 0.143 22.516 -25.890
- Income 1 0.232 22.605 -25.693
<none> 22.373 -24.208
- Density 1 0.929 23.302 -24.174
- HS.Grad 1 1.522 23.895 -22.917
- Population 1 2.205 24.578 -21.508
- Frost 1 3.132 25.506 -19.656
- Murder 1 26.707 49.080 13.072
Step: AIC=-25.89
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Density
Df Sum of Sq RSS AIC
- Income 1 0.132 22.648 -27.598
- Density 1 0.786 23.302 -26.174
<none> 22.516 -25.890
- HS.Grad 1 1.424 23.940 -24.824
- Population 1 2.332 24.848 -22.962
- Frost 1 3.304 25.820 -21.043
- Murder 1 32.779 55.295 17.033
Step: AIC=-27.6
Life.Exp ~ Population + Murder + HS.Grad + Frost + Density
Df Sum of Sq RSS AIC
- Density 1 0.660 23.308 -28.161
<none> 22.648 -27.598
- Population 1 2.659 25.307 -24.046
- Frost 1 3.179 25.827 -23.030
- HS.Grad 1 3.966 26.614 -21.529
- Murder 1 33.626 56.274 15.910
Step: AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost
Df Sum of Sq RSS AIC
<none> 23.308 -28.161
- Population 1 2.064 25.372 -25.920
- Frost 1 3.122 26.430 -23.876
- HS.Grad 1 5.112 28.420 -20.246
- Murder 1 34.816 58.124 15.528
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, data = st)
Coefficients:
(Intercept) Population Murder HS.Grad Frost
7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03
I'm against it! Having said that, I have to point out that the stepwise
regression reached the same model we did, through almost the same sequence of
steps, EXCEPT it kept the population term. This procedure does not make decisions
based on p-values but on a statistic called AIC (Akaike's Information Criterion,
which is like a golf score in that lower values are better). Notice in the last
step, the model throwing out nothing had an AIC of -28.161. If "Population"
is thrown out, that rises (bad!) to -25.920. So "Population" was retained. I
hate to say I told you so, especially based on a method I don't believe in,
but there you go. In the stepwise() function,
the "direction=" option can be set to "forward", "backward", or "both".
Confidence Limits on the Estimated Coefficients
I'm going to work with model5.
> confint(model5)
2.5 % 97.5 %
(Intercept) 6.910798e+01 72.9462729104
Population -4.543308e-07 0.0001007343
Murder -3.738840e-01 -0.2264135705
HS.Grad 1.671901e-02 0.0764454870
Frost -1.081918e-02 -0.0010673977
Set the desired confidence level with the "level=" option (not conf.level as in
many other R functions).
Predictions
Predictions can be made from a model equation using the
predict() function.
> predict(model5, list(Population=2000, Murder=10.5, HS.Grad=48, Frost=100))
1
69.61746
Feed it the model name and a list of labeled values for the predictors.
The values to be predicted from can be vectors.
Regression Diagnostics
The usual diagnostic plots are available.
> par(mfrow=c(2,2)) # visualize four graphs at once
> plot(model5)
> par(mfrow=c(1,1)) # reset the graphics defaults
The first of these plots (upper left) is a standard residuals plot with which
we can evaluate linearity (looks good) and homoscedasticity (also good). The
second plot (upper right) is a normal quantile plot of the residuals, which
should be normally distributed. We have a little deviation from normality
in the tails,
but it's not too bad. The scale-location plot is supposedly a more sensitive
way to look for heteroscedasticity, but I've never found it very useful, to
be honest. The last plot shows possible high leverage and influential cases.
These graphs could also be obtained with the following commands.
> plot(model5, 1) # syntax: plot(model.object.name, 1)
> plot(model5, 2)
> plot(model5, 3)
> plot(model5, 5)
There is also plot(model5, 4), which gives a
Cook's Distance plot.
Extracting Elements of the Model Object
The model object is a list containing quite a lot of information.
> names(model5)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
[6] "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
There are three ways to get at this information that I'm aware of.
> coef(model5) # an extractor function
(Intercept) Population Murder HS.Grad Frost
7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03
> model5$coefficients # list indexing
(Intercept) Population Murder HS.Grad Frost
7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03
> model5[[1]] # recall by position in the list (double brackets for lists)
(Intercept) Population Murder HS.Grad Frost
7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03
Aside from the coefficients, the residuals are the most interesting (to me).
Just enough of the list item name has to be given so that R can recognize what
it is you're asking for.
> model5$resid
Alabama Alaska Arizona Arkansas California
0.56888134 -0.54740399 -0.86415671 1.08626119 -0.08564599
Colorado Connecticut Delaware Florida Georgia
0.95645816 0.44541028 -1.06646884 0.04460505 -0.09694227
Hawaii Idaho Illinois Indiana Iowa
1.50683146 0.37010714 -0.05244160 -0.02158526 0.16347124
Kansas Kentucky Louisiana Maine Maryland
0.67648037 0.85582067 -0.39044846 -1.47095411 -0.29851996
Massachusetts Michigan Minnesota Mississippi Missouri
-0.61105391 0.76106640 0.69440380 -0.91535384 0.58389969
Montana Nebraska Nevada New Hampshire New Jersey
-0.84024805 0.42967691 -0.49482393 -0.49635615 -0.66612086
New Mexico New York North Carolina North Dakota Ohio
0.28880945 -0.07937149 -0.07624179 0.90350550 -0.26548767
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
0.26139958 -0.28445333 -0.95045527 0.13992982 -1.10109172
South Dakota Tennessee Texas Utah Vermont
0.06839119 0.64416651 0.92114057 0.84246817 0.57865019
Virginia Washington West Virginia Wisconsin Wyoming
-0.06691392 -0.96272426 -0.96982588 0.47004324 -0.58678863
Here is a particularly interesting way to look at the residuals.
> sort(model5$resid) # extract residuals and sort them
Maine South Carolina Delaware West Virginia Washington
-1.47095411 -1.10109172 -1.06646884 -0.96982588 -0.96272426
Pennsylvania Mississippi Arizona Montana New Jersey
-0.95045527 -0.91535384 -0.86415671 -0.84024805 -0.66612086
Massachusetts Wyoming Alaska New Hampshire Nevada
-0.61105391 -0.58678863 -0.54740399 -0.49635615 -0.49482393
Louisiana Maryland Oregon Ohio Georgia
-0.39044846 -0.29851996 -0.28445333 -0.26548767 -0.09694227
California New York North Carolina Virginia Illinois
-0.08564599 -0.07937149 -0.07624179 -0.06691392 -0.05244160
Indiana Florida South Dakota Rhode Island Iowa
-0.02158526 0.04460505 0.06839119 0.13992982 0.16347124
Oklahoma New Mexico Idaho Nebraska Connecticut
0.26139958 0.28880945 0.37010714 0.42967691 0.44541028
Wisconsin Alabama Vermont Missouri Tennessee
0.47004324 0.56888134 0.57865019 0.58389969 0.64416651
Kansas Minnesota Michigan Utah Kentucky
0.67648037 0.69440380 0.76106640 0.84246817 0.85582067
North Dakota Texas Colorado Arkansas Hawaii
0.90350550 0.92114057 0.95645816 1.08626119 1.50683146
That settles it! I'm moving to Hawaii. This could all just be random error, of
course, but more technically, it is unexplained variation. It could mean that
Maine is performing the worst against our model, and Hawaii the best, due to
factors we have not included in the model.
Beta Coefficients
Beta or standardized coefficients are the slopes we would get if all the
variables were on the same scale, which is done by converting them to z-scores
before doing the regression. Betas allow a comparison of the approximate
relative importance of the predictors, which neither the unstandardized
coefficients nor the p-values do. Scaling, or standardizing, the data vectors
can be done using the scale() function. Once the
scaled variables are created, the regression is redone using them. The
resulting coefficients are the beta coefficients.
That sounds like a lot of fuss and bother, so here is the way I do it. It's
based on the fact that a coefficient times the standard deviation of that
variable divided by the standard deviation of the response variable gives the
beta coefficient for the variable.
> coef(model5)
(Intercept) Population Murder HS.Grad Frost
7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03
> sdexpl = c(sd(st$Population), sd(st$Murder), sd(st$HS.Grad), sd(st$Frost))
> sdresp = sd(st$Life.Exp)
> coef(model5)[2:5] * sdexpl / sdresp
Population Murder HS.Grad Frost
0.1667540 -0.8253997 0.2802790 -0.2301391
So an increase of one standard deviation in the murder rate is associated with a
drop of 0.825 standard deviations in life expectancy, if all the other variables
are held constant. VERY roughly and approximately, these values can be
interpreted as
correlations with the other predictors controlled. We can see that HS.Grad and
Frost are of roughly the same importance in predicting Life.Exp. Population is
about right in there with them, or maybe a little less. Clearly, in our model,
Murder is the most important predictor of Life.Exp.
Partial Correlations
Another way to remove the effect of a possible lurking variable from the
correlation of two other variables is to calculate a partial correlation. There
is no partial correlation function in R base packages. There are optional
packages that have such functions, including the package "ggm", in which
the function is pcor().
Making Predictions From a Model
To illustrate how predictions are made from a model, I will fit a new model to
another data set.
> rm(list=ls()) # clean up (WARNING! this will wipe your workspace!)
> data(airquality) # see ?airquality for details on these data
> na.omit(airquality) -> airquality # toss cases with missing values
> lm.out = lm(Ozone ~ Solar.R + Wind + Temp + Month, data=airquality)
> coef(lm.out)
(Intercept) Solar.R Wind Temp Month
-58.05383883 0.04959683 -3.31650940 1.87087379 -2.99162786
> confint(lm.out)
2.5 % 97.5 %
(Intercept) -1.035964e+02 -12.51131722
Solar.R 3.088529e-03 0.09610512
Wind -4.596849e+00 -2.03616955
Temp 1.328372e+00 2.41337586
Month -5.997092e+00 0.01383629
The model has been fitted and the regression coefficients displayed. Suppose
now we wish to predict for new values: Solar.R=200, Wind=11, Temp=80, Month=6.
One way to do this is as follows.
> (prediction = c(1, 200, 11, 80, 6) * coef(lm.out))
(Intercept) Solar.R Wind Temp Month
-58.053839 9.919365 -36.481603 149.669903 -17.949767
> ### Note: putting the whole statement in parentheses not only stores the values
> ### but also prints them to the Console.
> ### Now we have the contribution of each variable to the prediction. It remains
> ### but to sum them.
> sum(prediction)
[1] 47.10406
Our predicted value is 47.1. How accurate is it?
We can get a confidence interval for the predicted value, but first we have to
ask an important question. Is the prediction being made for the mean response for
cases like this? Or is it being made for a new single case? The difference this
makes to the CI is shown below.
> ### Prediction of mean response for cases like this...
> predict(lm.out, list(Solar.R=200, Wind=11, Temp=80, Month=6), interval="conf")
fit lwr upr
1 47.10406 41.10419 53.10393
> ### Prediction for a single new case...
> predict(lm.out, list(Solar.R=200, Wind=11, Temp=80, Month=6), interval="pred")
fit lwr upr
1 47.10406 5.235759 88.97236
As you can see, the CI is much wider in the second case. These are 95% CIs. You
can set a different confidence level using the "level=" option. The function
predict() has a somewhat tricky syntax,
so here's how it works. The first
argument should be the name of the model object. The second argument should be
a list of new data values labeled with the variable names. That's all
that's really required. The "interval=" option can take values of "none" (the
default), "confidence", or "prediction". You need only type enough of it so R
knows which one you want. The confidence level is changed with the "level="
option, and not with "conf.level=" as in other R functions.
Sooooo, when we did confint(lm.out) above,
what are those confidence limits? Are they for predicting the mean response
for cases like this? Or for predicting for a single new case? Let's find out!
> intervals = confint(lm.out)
> t(intervals)
(Intercept) Solar.R Wind Temp Month
2.5 % -103.59636 0.003088529 -4.596849 1.328372 -5.99709200
97.5 % -12.51132 0.096105125 -2.036170 2.413376 0.01383629
> values = c(1,200,11,80,6)
> sum(colMeans(t(intervals)) * values)
[1] 47.10406
> t(intervals) %*% matrix(values)
[,1]
2.5 % -83.25681
97.5 % 177.46493
Neither!
revised 2016 February 17
| Table of Contents
| Function Reference
| Function Finder
| R Project |
|