| Table of Contents | Function Reference | Function Finder | R Project | MULTIPLE REGRESSION Preliminaries Model Formulae If you haven't yet read the tutorial on Model Formulae, now would be a good time! Statistical Modeling There is not space in this tutorial, and probably not on this server, to cover the complex issue of statistical modeling. For an excellent discussion, I refer you to Chapter 9 of Crawley (2007). Here I will restrict myself to a discussion of linear relationships. However, transforming variables to make them linear (see Simple Nonlinear Correlation and Regression) is straightforward, and a model including interaction terms is as easily created as it is to change plus signs to asterisks. (If you don't know what I'm talking about, read the Model Formulae tutorial.) R also includes a wealth of tools for more complex modeling, among them... rlm( ) for robust fitting of linear models (MASS package) glm( ) for generalized linear models (stats package) gam( ) for generalized additive models (gam package) lme( ) and lmer( ) for linear mixed-effects models (nlme and lme4 packages) nls( ) and nlme( ) for nonlinear models (stats and nlme packages) and I'm sure there are others I'm leaving out My familiarity with these functions is "less than thorough" (!), so I shall refrain from mentioning them further. Warning: An opinion follows! I am very much opposed to the current common practice in the social sciences (and education) of "letting the software decide" what statistical model is best. Statistical software is an assistant, not a master. It should remain in the hands of the researcher or data analyst which variables are retained in the model and which are excluded. Hopefully, these decisions will be based on a thorough knowledge of the phenomenon being modeled. Therefore, I will not discuss techniques such as stepwise regression or hierarchical regression here. Stepwise regression is implemented in R (see below for a VERY brief mention), and hierarchical regression (SPSS-like) can easily enough be done, although it is not automated (that I am aware of). The basic rule of statistical modeling should be: "all statistical models are wrong" (George Box). Letting the software have free reign to include and exclude model variables in no way changes or improves upon that situation. It is not more "objective," and it does not result in better models, only less well understood ones. Preliminary Examination of the Data For this analysis, we will use a built-in data set called state.x77. It is one of many helpful data sets R includes on the states of the U.S. However, it is not a data frame, so after loading it, we will coerce it into being a data frame. ```> state.x77 # output not shown > str(state.x77) # clearly not a data frame! (it's a matrix) > st = as.data.frame(state.x77) # so we'll make it one > str(st) 'data.frame': 50 obs. of 8 variables: \$ Population: num 3615 365 2212 2110 21198 ... \$ Income : num 3624 6315 4530 3378 5114 ... \$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ... \$ Life Exp : num 69.0 69.3 70.5 70.7 71.7 ... \$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ... \$ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ... \$ Frost : num 20 152 15 65 20 166 139 103 11 60 ... \$ Area : num 50708 566432 113417 51945 156361 ...``` A couple things have to be fixed before this will be convenient to use, and I also want to add a population density variable. Note that population density is a linear function of Population and Area, so these three vectors will not be linearly independent. ```> colnames(st)[4] = "Life.Exp" # no spaces in variable names, please > colnames(st)[6] = "HS.Grad" # ditto > st\$Density = st\$Population * 1000 / st\$Area # create a new column in st``` If you are unclear on what these variables are, or want more information, see the help page: ?state.x77. A straightforward summary is always a good place to begin, because for one thing it will find any variables that have missing values. Then, seeing as how we are looking for interrelationships between these variables, we should look at a correlation matrix, or perhaps a scatterplot matrix. ```> summary(st) Population Income Illiteracy Life.Exp Murder Min. : 365 Min. :3098 Min. :0.500 Min. :67.96 Min. : 1.400 1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12 1st Qu.: 4.350 Median : 2838 Median :4519 Median :0.950 Median :70.67 Median : 6.850 Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88 Mean : 7.378 3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89 3rd Qu.:10.675 Max. :21198 Max. :6315 Max. :2.800 Max. :73.60 Max. :15.100 HS.Grad Frost Area Density Min. :37.80 Min. : 0.00 Min. : 1049 Min. : 0.6444 1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985 1st Qu.: 25.3352 Median :53.25 Median :114.50 Median : 54277 Median : 73.0154 Mean :53.11 Mean :104.46 Mean : 70736 Mean :149.2245 3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81162 3rd Qu.:144.2828 Max. :67.30 Max. :188.00 Max. :566432 Max. :975.0033 > > cor(st) # correlation matrix (not shown, yet) > round(cor(st), 3) # rounding makes it easier to look at Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Density Population 1.000 0.208 0.108 -0.068 0.344 -0.098 -0.332 0.023 0.246 Income 0.208 1.000 -0.437 0.340 -0.230 0.620 0.226 0.363 0.330 Illiteracy 0.108 -0.437 1.000 -0.588 0.703 -0.657 -0.672 0.077 0.009 Life.Exp -0.068 0.340 -0.588 1.000 -0.781 0.582 0.262 -0.107 0.091 Murder 0.344 -0.230 0.703 -0.781 1.000 -0.488 -0.539 0.228 -0.185 HS.Grad -0.098 0.620 -0.657 0.582 -0.488 1.000 0.367 0.334 -0.088 Frost -0.332 0.226 -0.672 0.262 -0.539 0.367 1.000 0.059 0.002 Area 0.023 0.363 0.077 -0.107 0.228 0.334 0.059 1.000 -0.341 Density 0.246 0.330 0.009 0.091 -0.185 -0.088 0.002 -0.341 1.000 > > pairs(st) # scatterplot matrix; plot(st) does the same thing``` That, gives us plenty to look at, doesn't it? I propose we cast "Life.Exp" into the role of our response variable and attempt to model (predict) it from the other variables. Already we see some interesting relationships. Life expectancy appears to be moderately to strongly related to "Income" (positively), "Illiteracy" (negatively), "Murder" (negatively), "HS.Grad" (positively), and "Frost" (no. of days per year with frost, positively), and I call particular attention to this last relationship. These are all bivariate relationships--i.e., they don't take into account any of the remaining variables. Here's another interesting relationship, which we would find statistically significant. The correlation between average SAT scores by state and average teacher pay by state is negative. That is to say, the states with higher teacher pay have lower SAT scores. (How much political hay has been made of that? These politicians either don't understand the nature of regression relationships, or they are spinning the data to tell the story they want you to hear!!) We would also find a very strong negative correlation between SAT scores and percent of eligible students who take the test. I.e., the higher the percentage of HS seniors who take the SAT, the lower the average score. Here is the crucial point. Teacher pay is positively correlated with percent of students who take the test. There is a potential confound here. In regression terms, we would call percent of students who take the test a "lurking variable." The advantage of looking at these variables in a multiple regression is we are able to see the relationship between two variables with the others taken out. To say this another way, the regression analysis "holds the potential lurking variables constant" while it is figuring out the relationship between the variables of interest. When SAT scores, teacher pay, and percent taking the test are all entered into the regression relationship, the confound between pay and percent is controlled, and the relationship between teacher pay and SAT scores is shown to be significantly positive. (Note: If you get the Faraway data sets when we do the ANCOVA tutorial, you will have a data set, sat.txt, that will allow you to examine this. It's an interesting problem and a recommended exercise.) The point of this long discussion is this. Life expectancy does have a bivariate relationship with a lot of the other variables, but many of those variables are also related to each other. Multiple regression allows us to tease all of this apart and see in a "purer" (but not pure) form what the relationship really is between, say, life expectancy and average number of days with frost. I suggest you watch this relationship as we proceed through the analysis. We should also examine the variables for outliers and distribution before proceeding, but for the sake of brevity, we'll skip it. But a quick way to do it all at once would be: ```> par(mfrow=c(3,3)) > for(i in 1:9) qqnorm(st[,i], main=colnames(st)[i])``` The Maximal or Full Model (Sans Interactions) We begin by throwing all the predictors into a linear additive model. ```> options(show.signif.stars=F) # I don't like significance stars! > names(st) # for handy reference [1] "Population" "Income" "Illiteracy" "Life.Exp" "Murder" [6] "HS.Grad" "Frost" "Area" "Density" > ### model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + + ### HS.Grad + Frost + Area + Density, data=st) > ### This is what we're going to accomplish, but more economically, by > ### simply placing a dot after the tilde, which means "everything else." > model1 = lm(Life.Exp ~ ., data=st) > summary(model1) Call: lm(formula = Life.Exp ~ ., data = st) Residuals: Min 1Q Median 3Q Max -1.47514 -0.45887 -0.06352 0.59362 1.21823 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.995e+01 1.843e+00 37.956 < 2e-16 Population 6.480e-05 3.001e-05 2.159 0.0367 Income 2.701e-04 3.087e-04 0.875 0.3867 Illiteracy 3.029e-01 4.024e-01 0.753 0.4559 Murder -3.286e-01 4.941e-02 -6.652 5.12e-08 HS.Grad 4.291e-02 2.332e-02 1.840 0.0730 Frost -4.580e-03 3.189e-03 -1.436 0.1585 Area -1.558e-06 1.914e-06 -0.814 0.4205 Density -1.105e-03 7.312e-04 -1.511 0.1385 Residual standard error: 0.7337 on 41 degrees of freedom Multiple R-squared: 0.7501, Adjusted R-squared: 0.7013 F-statistic: 15.38 on 8 and 41 DF, p-value: 3.787e-10``` It appears higher populations are related to increased life expectancy and higher murder rates are strongly related to decreased life expectancy. High school graduation rate is marginal. Other than that, we're not seeing much. Another kind of summary of the model can be obtained like this. ```> summary.aov(model1) Df Sum Sq Mean Sq F value Pr(>F) Population 1 0.4089 0.4089 0.7597 0.388493 Income 1 11.5946 11.5946 21.5413 3.528e-05 Illiteracy 1 19.4207 19.4207 36.0811 4.232e-07 Murder 1 27.4288 27.4288 50.9591 1.051e-08 HS.Grad 1 4.0989 4.0989 7.6152 0.008612 Frost 1 2.0488 2.0488 3.8063 0.057916 Area 1 0.0011 0.0011 0.0020 0.964381 Density 1 1.2288 1.2288 2.2830 0.138472 Residuals 41 22.0683 0.5383``` These are sequential tests on the terms, as if they were entered one at a time in a hierarchical fashion, beginning with Population by itself, then Income added to Population, then Illiteracy added to Population and Income, etc. Notice that the test on Density yields the same p-value, because in the sequential tests it is entered last, after everything else, just as it is (and all the other terms are) in the regression analysis. Make what you will of these tests. I'm not saying they aren't useful or interesting, but I won't be referring to them again. A note on interactions. What about the possibility that there might be important interactions between some of these terms? It's a legitimate question, and I'll address it in a little bit, but including interactions in a regression model raises some special issues. In this case particularly, it would probably be a bad idea to test too many because of the linear dependence among some of our variables. Even without that problem though, testing interactions usually involves centering the variables, AND just imagine how many of them there would be! Plus, interpreting higher order interactions can be difficult, to say the least. I think we need to simplify our model a bit before we start fooling with interactions. I will mention them again below. Now we need to start winnowing down our model to a minimal adequate one. The least significant of the slopes in the above summary table (two tables ago--NOT the sequential ANOVA table) was that for "Illiteracy", so some would say that term must go first. I beg to differ. Rather than letting a bunch of (I assume) insentient electrons make this decision for me, I'm going to take matters into my own (hopefully) capable hands and let the sodium and potassium ions in my brain make the decision. I should decide, not the machine. The machine is my assistant, not my boss! I like the "Illiteracy" term, and I can see how it might well be related to life expectancy. I cannot see how "Area" of the state in which a person lives could be related, except in that it might be confounded with something else. "Area" has almost as high a p-value as "Illiteracy", so that is the first term I am going to toss from my model. The Minimal Adequate Model We want to reduce our model to a point where all the remaining predictors are significant, and we want to do this by throwing out one predictor at a time. "Area" will go first. To do this, we could just recast the model without the "Area" variable, but R supplies a handier way. ```> model2 = update(model1, .~. -Area) > summary(model2) Call: lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Density, data = st) Residuals: Min 1Q Median 3Q Max -1.5025 -0.4047 -0.0608 0.5868 1.4386 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.094e+01 1.378e+00 51.488 < 2e-16 Population 6.249e-05 2.976e-05 2.100 0.0418 Income 1.485e-04 2.690e-04 0.552 0.5840 Illiteracy 1.452e-01 3.512e-01 0.413 0.6814 Murder -3.319e-01 4.904e-02 -6.768 3.12e-08 HS.Grad 3.746e-02 2.225e-02 1.684 0.0996 Frost -5.533e-03 2.955e-03 -1.873 0.0681 Density -7.995e-04 6.251e-04 -1.279 0.2079 Residual standard error: 0.7307 on 42 degrees of freedom Multiple R-squared: 0.746, Adjusted R-squared: 0.7037 F-statistic: 17.63 on 7 and 42 DF, p-value: 1.173e-10``` The syntax of update() is cryptic, so here's what it means. Inside the new formula, a dot means "same." So read the above update command as follows: "Update model1 using the same response variables and the same predictor variables, except remove (minus) "Area". Notice that removing "Area" has cost us very little in terms of R-squared, and the adjusted R-squared actually went up, due to there being fewer predictors. The two models can be compared as follows. ```> anova(model2, model1) # enter the name of the reduced model first Analysis of Variance Table Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Density Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area + Density Res.Df RSS Df Sum of Sq F Pr(>F) 1 42 22.425 2 41 22.068 1 0.35639 0.6621 0.4205``` As you can see, removing "Area" had no significant effect on the model (p = .4205). Compare the p-value to that for "Area" in the first summary table above. What goes next, "Income" or "Illiteracy"? I can see how either one of these could be related to life expectancy. In fact, the only one of these variables I'm a little leary of at this point is "Frost". But I'm also intrigued by it, and it is close to being significant. So, my assistant is telling me to toss out "Illiteracy", and reluctantly I agree. ```> model3 = update(model2, .~. -Illiteracy) > summary(model3) Call: lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Density, data = st) Residuals: Min 1Q Median 3Q Max -1.49555 -0.41246 -0.05336 0.58399 1.50535 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.131e+01 1.042e+00 68.420 < 2e-16 Population 5.811e-05 2.753e-05 2.110 0.0407 Income 1.324e-04 2.636e-04 0.502 0.6181 Murder -3.208e-01 4.054e-02 -7.912 6.32e-10 HS.Grad 3.499e-02 2.122e-02 1.649 0.1065 Frost -6.191e-03 2.465e-03 -2.512 0.0158 Density -7.324e-04 5.978e-04 -1.225 0.2272 Residual standard error: 0.7236 on 43 degrees of freedom Multiple R-squared: 0.745, Adjusted R-squared: 0.7094 F-statistic: 20.94 on 6 and 43 DF, p-value: 2.632e-11``` Things are starting to change a bit. R-squared went down again, as it will always do when a predictor is removed, but once more, adjusted R-squared increased. "Frost" is also now a significant predictor of life expectancy, and unlike in the bivariate relationship we saw originally, the relationship is now negative. Looks like "Income" should go next, but gosh darn it, how can income not be related to life expectancy? Well, you have to be careful how you say that. The model is NOT telling us that income is unrelated to life expectancy! It's telling us that, within the parameters of this model, "Income" is now the worst predictor. A look back at the correlation matrix shows that "Income" is strongly correlated with "HS.Grad". Maybe "HS.Grad" is stealing some its thunder. On the other hand, income is important because it can buy you things, like a house in a neighborhood with good schools and low murder rate, and in the "burbs" (lower population density). Sorry "Income" but I gotta give you the ax. (NOTE: The important thing is, I thought about it and came to this conclusion rationally. I didn't blindly let the numbers tell me what to do--entirely.) ```> model4 = update(model3, .~. -Income) > summary(model4) Call: lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost + Density, data = st) Residuals: Min 1Q Median 3Q Max -1.56877 -0.40951 -0.04554 0.57362 1.54752 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.142e+01 1.011e+00 70.665 < 2e-16 Population 6.083e-05 2.676e-05 2.273 0.02796 Murder -3.160e-01 3.910e-02 -8.083 3.07e-10 HS.Grad 4.233e-02 1.525e-02 2.776 0.00805 Frost -5.999e-03 2.414e-03 -2.485 0.01682 Density -5.864e-04 5.178e-04 -1.132 0.26360 Residual standard error: 0.7174 on 44 degrees of freedom Multiple R-squared: 0.7435, Adjusted R-squared: 0.7144 F-statistic: 25.51 on 5 and 44 DF, p-value: 5.524e-12``` R-squared went down hardly at all. Adjusted R-squared went up. "Income" will not be missed FROM THIS MODEL. Now all the predictors are significant except "Density", and while my heart says stay, I have to admit that looking at the correlation matrix tells me "Density" isn't really strongly related to anything. ```> model5 = update(model4, .~. -Density) > summary(model5) Call: lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, data = st) Residuals: Min 1Q Median 3Q Max -1.47095 -0.53464 -0.03701 0.57621 1.50683 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 Population 5.014e-05 2.512e-05 1.996 0.05201 Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 Frost -5.943e-03 2.421e-03 -2.455 0.01802 Residual standard error: 0.7197 on 45 degrees of freedom Multiple R-squared: 0.736, Adjusted R-squared: 0.7126 F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12``` Adjusted R-squared slipped a bit that time, but not significantly as we could confirm by... `> anova(model5, model4) # output not shown` What I really want to do here is compare this model to the original full model. ```> anova(model5, model1) Analysis of Variance Table Model 1: Life.Exp ~ Population + Murder + HS.Grad + Frost Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area + Density Res.Df RSS Df Sum of Sq F Pr(>F) 1 45 23.308 2 41 22.068 4 1.2397 0.5758 0.6818``` It looks like we have lost nothing of significance, so to speak. I.e., the two models are not significantly different. We have (implicitly by not saying otherwise) set our alpha level at .05, so now "Population" must go. So say the bean counters. This could have a substantial effect on the model, as the slope for "Population" is very nearly significant. We already know that creating a model without it and comparing it to the current model (model5) would result in a p-value of 0.052. Gosh, it's so close! But as my old stat professor (Tom Wickens) would say, close only counts in horseshoes and handgrenades. I have to disagree. I am not one of those who worships the magic number point-oh-five. I don't see much difference between 0.052 and 0.048. Before we make this decision, let's look at something else first. I raised the issue of interactions above. Is it possible that there are some interactions among these variables that might be important? Granted, we've already tossed out quite a few of the variables, so maybe it's a little late to be asking, but better late than never. (You can quote me!) Now that we have few enough terms where interactions might start to make sense, let's test them all at once. I will test everything up to the three-way interactions. I'm not leaving out the four-way interaction for any good reason other than to demonstrate that I can. Include it if you want. ```> model.int = update(model5, .~.^3) > anova(model5, model.int) Analysis of Variance Table Model 1: Life.Exp ~ Population + Murder + HS.Grad + Frost Model 2: Life.Exp ~ Population + Murder + HS.Grad + Frost + Population:Murder + Population:HS.Grad + Population:Frost + Murder:HS.Grad + Murder:Frost + HS.Grad:Frost + Population:Murder:HS.Grad + Population:Murder:Frost + Population:HS.Grad:Frost + Murder:HS.Grad:Frost Res.Df RSS Df Sum of Sq F Pr(>F) 1 45 23.308 2 35 15.869 10 7.4395 1.6409 0.1356``` In a model formula, ^ is a way to specify interactions, so .^3 means include all the terms in the current model (the dot) and all interactions up to three-way. If you want to limit it to two-way, then use .^2. If you want the four-way, then it's .^4. It appears from our resultant comparison of the models with and without the interactions, that the interactions do not make a significant contribution to the model. So I'm not going to worry about them further. Now we have to make a decision about "Population". My gut says it should stay, and it's a pretty substantial gut! But to keep the point-oh-fivers happy, let's get rid of it. ```> model6 = update(model5, .~. -Population) > summary(model6) Call: lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = st) Residuals: Min 1Q Median 3Q Max -1.5015 -0.5391 0.1014 0.5921 1.2268 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 71.036379 0.983262 72.246 < 2e-16 Murder -0.283065 0.036731 -7.706 8.04e-10 HS.Grad 0.049949 0.015201 3.286 0.00195 Frost -0.006912 0.002447 -2.824 0.00699 Residual standard error: 0.7427 on 46 degrees of freedom Multiple R-squared: 0.7127, Adjusted R-squared: 0.6939 F-statistic: 38.03 on 3 and 46 DF, p-value: 1.634e-12``` We have reached the minimal adequate model in which all the slopes are statistically significant. Stepwise Regression What we did in the previous section can be automated using the step() function. ```> step(model1, direction="backward") Start: AIC=-22.89 Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area + Density Df Sum of Sq RSS AIC - Illiteracy 1 0.305 22.373 -24.208 - Area 1 0.356 22.425 -24.093 - Income 1 0.412 22.480 -23.969 22.068 -22.894 - Frost 1 1.110 23.178 -22.440 - Density 1 1.229 23.297 -22.185 - HS.Grad 1 1.823 23.891 -20.926 - Population 1 2.510 24.578 -19.509 - Murder 1 23.817 45.886 11.706 Step: AIC=-24.21 Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Area + Density Df Sum of Sq RSS AIC - Area 1 0.143 22.516 -25.890 - Income 1 0.232 22.605 -25.693 22.373 -24.208 - Density 1 0.929 23.302 -24.174 - HS.Grad 1 1.522 23.895 -22.917 - Population 1 2.205 24.578 -21.508 - Frost 1 3.132 25.506 -19.656 - Murder 1 26.707 49.080 13.072 Step: AIC=-25.89 Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Density Df Sum of Sq RSS AIC - Income 1 0.132 22.648 -27.598 - Density 1 0.786 23.302 -26.174 22.516 -25.890 - HS.Grad 1 1.424 23.940 -24.824 - Population 1 2.332 24.848 -22.962 - Frost 1 3.304 25.820 -21.043 - Murder 1 32.779 55.295 17.033 Step: AIC=-27.6 Life.Exp ~ Population + Murder + HS.Grad + Frost + Density Df Sum of Sq RSS AIC - Density 1 0.660 23.308 -28.161 22.648 -27.598 - Population 1 2.659 25.307 -24.046 - Frost 1 3.179 25.827 -23.030 - HS.Grad 1 3.966 26.614 -21.529 - Murder 1 33.626 56.274 15.910 Step: AIC=-28.16 Life.Exp ~ Population + Murder + HS.Grad + Frost Df Sum of Sq RSS AIC 23.308 -28.161 - Population 1 2.064 25.372 -25.920 - Frost 1 3.122 26.430 -23.876 - HS.Grad 1 5.112 28.420 -20.246 - Murder 1 34.816 58.124 15.528 Call: lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, data = st) Coefficients: (Intercept) Population Murder HS.Grad Frost 7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03``` I'm against it! Having said that, I have to point out that the stepwise regression reached the same model we did, through almost the same sequence of steps, EXCEPT it kept the population term. This procedure does not make decisions based on p-values but on a statistic called AIC (Akaike's Information Criterion, which is like a golf score in that lower values are better). Notice in the last step, the model throwing out nothing had an AIC of -28.161. If "Population" is thrown out, that rises (bad!) to -25.920. So "Population" was retained. I hate to say I told you so, especially based on a method I don't believe in, but there you go. In the stepwise() function, the "direction=" option can be set to "forward", "backward", or "both". Confidence Limits on the Estimated Coefficients I'm going to work with model5. ```> confint(model5) 2.5 % 97.5 % (Intercept) 6.910798e+01 72.9462729104 Population -4.543308e-07 0.0001007343 Murder -3.738840e-01 -0.2264135705 HS.Grad 1.671901e-02 0.0764454870 Frost -1.081918e-02 -0.0010673977``` Set the desired confidence level with the "level=" option (not conf.level as in many other R functions). Predictions Predictions can be made from a model equation using the predict() function. ```> predict(model5, list(Population=2000, Murder=10.5, HS.Grad=48, Frost=100)) 1 69.61746``` Feed it the model name and a list of labeled values for the predictors. The values to be predicted from can be vectors. Regression Diagnostics The usual diagnostic plots are available. ```> par(mfrow=c(2,2)) # visualize four graphs at once > plot(model5) > par(mfrow=c(1,1)) # reset the graphics defaults``` The first of these plots (upper left) is a standard residuals plot with which we can evaluate linearity (looks good) and homoscedasticity (also good). The second plot (upper right) is a normal quantile plot of the residuals, which should be normally distributed. We have a little deviation from normality in the tails, but it's not too bad. The scale-location plot is supposedly a more sensitive way to look for heteroscedasticity, but I've never found it very useful, to be honest. The last plot shows possible high leverage and influential cases. These graphs could also be obtained with the following commands. ```> plot(model5, 1) # syntax: plot(model.object.name, 1) > plot(model5, 2) > plot(model5, 3) > plot(model5, 5)``` There is also plot(model5, 4), which gives a Cook's Distance plot. Extracting Elements of the Model Object The model object is a list containing quite a lot of information. ```> names(model5) [1] "coefficients" "residuals" "effects" "rank" "fitted.values" [6] "assign" "qr" "df.residual" "xlevels" "call" [11] "terms" "model"``` There are three ways to get at this information that I'm aware of. ```> coef(model5) # an extractor function (Intercept) Population Murder HS.Grad Frost 7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03 > model5\$coefficients # list indexing (Intercept) Population Murder HS.Grad Frost 7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03 > model5[[1]] # recall by position in the list (double brackets for lists) (Intercept) Population Murder HS.Grad Frost 7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03``` Aside from the coefficients, the residuals are the most interesting (to me). Just enough of the list item name has to be given so that R can recognize what it is you're asking for. ```> model5\$resid Alabama Alaska Arizona Arkansas California 0.56888134 -0.54740399 -0.86415671 1.08626119 -0.08564599 Colorado Connecticut Delaware Florida Georgia 0.95645816 0.44541028 -1.06646884 0.04460505 -0.09694227 Hawaii Idaho Illinois Indiana Iowa 1.50683146 0.37010714 -0.05244160 -0.02158526 0.16347124 Kansas Kentucky Louisiana Maine Maryland 0.67648037 0.85582067 -0.39044846 -1.47095411 -0.29851996 Massachusetts Michigan Minnesota Mississippi Missouri -0.61105391 0.76106640 0.69440380 -0.91535384 0.58389969 Montana Nebraska Nevada New Hampshire New Jersey -0.84024805 0.42967691 -0.49482393 -0.49635615 -0.66612086 New Mexico New York North Carolina North Dakota Ohio 0.28880945 -0.07937149 -0.07624179 0.90350550 -0.26548767 Oklahoma Oregon Pennsylvania Rhode Island South Carolina 0.26139958 -0.28445333 -0.95045527 0.13992982 -1.10109172 South Dakota Tennessee Texas Utah Vermont 0.06839119 0.64416651 0.92114057 0.84246817 0.57865019 Virginia Washington West Virginia Wisconsin Wyoming -0.06691392 -0.96272426 -0.96982588 0.47004324 -0.58678863``` Here is a particularly interesting way to look at the residuals. ```> sort(model5\$resid) # extract residuals and sort them Maine South Carolina Delaware West Virginia Washington -1.47095411 -1.10109172 -1.06646884 -0.96982588 -0.96272426 Pennsylvania Mississippi Arizona Montana New Jersey -0.95045527 -0.91535384 -0.86415671 -0.84024805 -0.66612086 Massachusetts Wyoming Alaska New Hampshire Nevada -0.61105391 -0.58678863 -0.54740399 -0.49635615 -0.49482393 Louisiana Maryland Oregon Ohio Georgia -0.39044846 -0.29851996 -0.28445333 -0.26548767 -0.09694227 California New York North Carolina Virginia Illinois -0.08564599 -0.07937149 -0.07624179 -0.06691392 -0.05244160 Indiana Florida South Dakota Rhode Island Iowa -0.02158526 0.04460505 0.06839119 0.13992982 0.16347124 Oklahoma New Mexico Idaho Nebraska Connecticut 0.26139958 0.28880945 0.37010714 0.42967691 0.44541028 Wisconsin Alabama Vermont Missouri Tennessee 0.47004324 0.56888134 0.57865019 0.58389969 0.64416651 Kansas Minnesota Michigan Utah Kentucky 0.67648037 0.69440380 0.76106640 0.84246817 0.85582067 North Dakota Texas Colorado Arkansas Hawaii 0.90350550 0.92114057 0.95645816 1.08626119 1.50683146``` That settles it! I'm moving to Hawaii. This could all just be random error, of course, but more technically, it is unexplained variation. It could mean that Maine is performing the worst against our model, and Hawaii the best, due to factors we have not included in the model. Beta Coefficients Beta or standardized coefficients are the slopes we would get if all the variables were on the same scale, which is done by converting them to z-scores before doing the regression. Betas allow a comparison of the approximate relative importance of the predictors, which neither the unstandardized coefficients nor the p-values do. Scaling, or standardizing, the data vectors can be done using the scale() function. Once the scaled variables are created, the regression is redone using them. The resulting coefficients are the beta coefficients. That sounds like a lot of fuss and bother, so here is the way I do it. It's based on the fact that a coefficient times the standard deviation of that variable divided by the standard deviation of the response variable gives the beta coefficient for the variable. ```> coef(model5) (Intercept) Population Murder HS.Grad Frost 7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03 > sdexpl = c(sd(st\$Population), sd(st\$Murder), sd(st\$HS.Grad), sd(st\$Frost)) > sdresp = sd(st\$Life.Exp) > coef(model5)[2:5] * sdexpl / sdresp Population Murder HS.Grad Frost 0.1667540 -0.8253997 0.2802790 -0.2301391``` So an increase of one standard deviation in the murder rate is associated with a drop of 0.825 standard deviations in life expectancy, if all the other variables are held constant. VERY roughly and approximately, these values can be interpreted as correlations with the other predictors controlled. We can see that HS.Grad and Frost are of roughly the same importance in predicting Life.Exp. Population is about right in there with them, or maybe a little less. Clearly, in our model, Murder is the most important predictor of Life.Exp. Partial Correlations Another way to remove the effect of a possible lurking variable from the correlation of two other variables is to calculate a partial correlation. There is no partial correlation function in R base packages. There are optional packages that have such functions, including the package "ggm", in which the function is pcor(). Making Predictions From a Model To illustrate how predictions are made from a model, I will fit a new model to another data set. ```> rm(list=ls()) # clean up (WARNING! this will wipe your workspace!) > data(airquality) # see ?airquality for details on these data > na.omit(airquality) -> airquality # toss cases with missing values > lm.out = lm(Ozone ~ Solar.R + Wind + Temp + Month, data=airquality) > coef(lm.out) (Intercept) Solar.R Wind Temp Month -58.05383883 0.04959683 -3.31650940 1.87087379 -2.99162786 > confint(lm.out) 2.5 % 97.5 % (Intercept) -1.035964e+02 -12.51131722 Solar.R 3.088529e-03 0.09610512 Wind -4.596849e+00 -2.03616955 Temp 1.328372e+00 2.41337586 Month -5.997092e+00 0.01383629``` The model has been fitted and the regression coefficients displayed. Suppose now we wish to predict for new values: Solar.R=200, Wind=11, Temp=80, Month=6. One way to do this is as follows. ```> (prediction = c(1, 200, 11, 80, 6) * coef(lm.out)) (Intercept) Solar.R Wind Temp Month -58.053839 9.919365 -36.481603 149.669903 -17.949767 > ### Note: putting the whole statement in parentheses not only stores the values > ### but also prints them to the Console. > ### Now we have the contribution of each variable to the prediction. It remains > ### but to sum them. > sum(prediction) [1] 47.10406``` Our predicted value is 47.1. How accurate is it? We can get a confidence interval for the predicted value, but first we have to ask an important question. Is the prediction being made for the mean response for cases like this? Or is it being made for a new single case? The difference this makes to the CI is shown below. ```> ### Prediction of mean response for cases like this... > predict(lm.out, list(Solar.R=200, Wind=11, Temp=80, Month=6), interval="conf") fit lwr upr 1 47.10406 41.10419 53.10393 > ### Prediction for a single new case... > predict(lm.out, list(Solar.R=200, Wind=11, Temp=80, Month=6), interval="pred") fit lwr upr 1 47.10406 5.235759 88.97236``` As you can see, the CI is much wider in the second case. These are 95% CIs. You can set a different confidence level using the "level=" option. The function predict() has a somewhat tricky syntax, so here's how it works. The first argument should be the name of the model object. The second argument should be a list of new data values labeled with the variable names. That's all that's really required. The "interval=" option can take values of "none" (the default), "confidence", or "prediction". You need only type enough of it so R knows which one you want. The confidence level is changed with the "level=" option, and not with "conf.level=" as in other R functions. Sooooo, when we did confint(lm.out) above, what are those confidence limits? Are they for predicting the mean response for cases like this? Or for predicting for a single new case? Let's find out! ```> intervals = confint(lm.out) > t(intervals) (Intercept) Solar.R Wind Temp Month 2.5 % -103.59636 0.003088529 -4.596849 1.328372 -5.99709200 97.5 % -12.51132 0.096105125 -2.036170 2.413376 0.01383629 > values = c(1,200,11,80,6) > sum(colMeans(t(intervals)) * values) [1] 47.10406 > t(intervals) %*% matrix(values) [,1] 2.5 % -83.25681 97.5 % 177.46493``` Neither! revised 2016 February 17 | Table of Contents | Function Reference | Function Finder | R Project |