MULTIPLE REGRESSION WITH INTERACTIONS Preliminary Notes "Seriously? This has to be covered separately?" I'm afraid it does. Interaction of continuous variables is not a simple thing. Some people refer to the analysis of interactions in multiple regression as moderator analysis and to the interacting variable as a moderating variable. I'll talk a little about that below, but moderator analysis is only a subset of interaction analysis. I'll be discussing one particular type of interaction, called a bilinear interaction, which is represented in the regression equation as a product term. You should be aware that there are other ways that continuous variables can interact. Failing to find a significiant product term in a multiple regression analysis is not sufficient evidence to say that the variables are not interacting. Also, if you haven't read the Model Formulae tutorial yet, well, I'm sure by now you know the drill. It would also be helpful if you were familiar with Multiple Regression. (Come on, King! Did you really need to say that?) Product Terms and What They Mean In the case of simple linear regression, our analysis resulted in a regression equation of this form. Y-hat = b_{0} + b_{1}X Y-hat stands for the predicted value of Y and is usually written as a Y with a caret symbol, ^, over it. I'm not sure I can program that it HTML, and even if I could, there's no guarantee that your browser would faithfully reproduce it. The traditional reading of the symbol is "Y hat". I don't want to type that a whole lot, but I also don't want to simply use Y because students, especially, often forget that we are using the regression equation to calculate predicted values of Y, and not the actual observed values. So I'm going to break with precedent and use Pr to stand for the predicted value we get from a regression equation. And since I'm not a big fan of subscripts, especially when I have to program them into HTML, that will allow me to use X, Y, and Z as the predictor variables in place of X_{1}, etc. So our simple linear regression equation becomes... Pr = b_{0} + b_{1}X In multiple regression without interactions, we added more predictors to the regression equation. Pr = b_{0} + b_{1}X + b_{2}Y Pr = b_{0} + b_{1}X + b_{2}Y + b_{3}Z Etc. In principle, there is no limit to the number of predictors that can be added, although with n subjects, once we get to k-1 predictors, no matter what they are, we are making perfect predictions and learning nothing. So generally we are advised to hold it down to a reasonable, i.e., parsimonious, number of predictors. To test for the presence of an interaction between X and Y, customarily an XY product term is added to the equation. Pr = b_{0} + b_{1}X + b_{2}Y + b_{3}XY As I've already pointed out, that's not the only way that predictors can interact, but when you ask your software to test for an interaction, that's almost certainly the test you're going to get, unless you go to some trouble to specify something different. That will be the test you get in R if you use the X:Y or X*Y notation in the model formula. What does it mean? In the case of simple linear regression (one predictor), the regression equation specifies the graph of a straight line, called the regression line. In the case of multiple regression with two predictors and no interaction, the regression line become a regression plane in 3D space. (Don't try to visualize what it is with more predictors. You could sprain your visual cortex!) The plane represents the predicted values (Pr), but the actual or observed values, in most cases, will not be on the plane but some distance above or below it. The vertical distance from a predicted value to the corresponding observed value is called a residual. Points above the plane, like the illustrated one, have positive residuals. Points below the plane have negative residuals. All of the lines in the grid representing this plane are straight. (Geometrically, they are "lines," not "curves.") All of the lines running from left to right have the same slope, which is the value of b_{1} in the regression equation, and is called the effect of X (sometimes the main effect of X). All of the lines running from front to back have the same slope, which is the value of b_{2} in the regression equation, and is called the effect of Y (sometimes the main effect of Y). The point where the plane touches the Pr axis is the intercept, which has the value b_{0} in the regression equation. Add a product interaction term, b_{3}XY, and this causes the plane to become twisted. Not bowed, twisted. All of the lines in the grid are still straight (they are still "lines"), but the lines running from left to right no longer all have the same slope, and the lines running from front to back no longer all have the same slope. Each of the lines in this grid running from left to right (and all other lines that you might imagine doing so) are called simple effects of X. Likewise, all of the lines running from front to back are called simple effects of Y. Those simple effects are still linear, hence, bilinear interaction. But notice that the size of any simple effect of X depends upon the value of Y, and the size of any simple effect of Y depends upon the value of X. This can be shown from the regression equation with a little bit of algebra. Pr = b_{0} + b_{1}X + b_{2}Y + b_{3}XY The slope of any line going from left to right, a simple effect of X, is b_{1} + b_{3}Y. The slope of any line going from front to back, a simple effect of Y, is b_{2} + b_{3}X. Thus, the coefficient of the product term, b_{3}, specifies how much the simple effect of X changes as Y changes, and how much the simple effect of Y changes as X changes. Perhaps most importantly, without the product term (interaction), b_{1} is the effect of X at any value of Y, and b_{2} is the effect of Y at any value of X. Hence, the coefficients are rightfully referred to as "effects." With an interaction in force, that is no longer true. With a product term, b_{1} now represents the effect of X only when Y is equal to 0, and b_{2} represents the effect of Y only when X is equal to 0. In other words, b_{1} and b_{2} are now simple effect of X and Y when the other equals 0. You might want to study the above diagram for a moment to get the full implications of this. Notice that X can have a pretty big effect on Pr, but when Y=0, the effect of X is minimal. Same for Y. Finally, An Example! I'm going to try to figure out what variables are related to gas mileage in
cars using some data from the 1970s (because that's the kinda guy I am). The
built-in data set is called "mtcars", and I'm going to take only some of the
variables available in it to keep things "simple."
> head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > cars = mtcars[,c(1,3,4,6)] > dim(cars) [1] 32 4 > summary(cars) mpg disp hp wt Min. :10.40 Min. : 71.1 Min. : 52.0 Min. :1.513 1st Qu.:15.43 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:2.581 Median :19.20 Median :196.3 Median :123.0 Median :3.325 Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.217 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.610 Max. :33.90 Max. :472.0 Max. :335.0 Max. :5.424 > cor(cars) # r = .449 is the critical value, two-tailed, at alpha = .01 mpg disp hp wt mpg 1.0000000 -0.8475514 -0.7761684 -0.8676594 disp -0.8475514 1.0000000 0.7909486 0.8879799 hp -0.7761684 0.7909486 1.0000000 0.6587479 wt -0.8676594 0.8879799 0.6587479 1.0000000The response is "mpg" (miles per U.S. gallon), and we have three predictors available, "disp" (engine displacement), "hp" (horsepower), and "wt" (weight of the car). All three predictors are very strongly negatively correlated with the response. They are also strongly correlated with each other, meaning we might have a problem with collinearity here. Adding product terms to the model will very likely make that problem even more severe. For example... > with(cars, cor(wt, wt*hp)) [1] 0.8647691Not pretty! Let's check to see if the relationships are reasonably linear.
> pairs(cars, cex=.6)It appears to me that the relationship of mpg to all three of the predictors might be a tad curvilinear. Once again, to keep things simple, I'm going to ignore it. We'll worry about the collinearity problem below. In the last tutorial, we got additive models by separating the predictors
in the model formula with plus signs. Now we want models including the
interactions, so instead of using +, we use asterisks, *.
> lm.out = lm(mpg ~ disp * hp * wt, data=cars)There are several extractor functions we can use to get information from our model object. > coef(lm.out) # not shown > fitted(lm.out) # not shown > residuals(lm.out) # not shownBut by far the most useful one is... > summary(lm.out) Call: lm(formula = mpg ~ disp * hp * wt, data = cars) Residuals: Min 1Q Median 3Q Max -2.9439 -1.5613 -0.5286 1.5307 3.9042 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.255e+01 9.596e+00 5.476 1.25e-05 *** disp -6.416e-02 9.324e-02 -0.688 0.4980 hp -1.426e-01 8.282e-02 -1.722 0.0980 . wt -7.511e+00 3.722e+00 -2.018 0.0549 . disp:hp 3.297e-04 4.180e-04 0.789 0.4379 disp:wt 1.237e-02 2.598e-02 0.476 0.6382 hp:wt 2.614e-02 2.936e-02 0.890 0.3822 disp:hp:wt -6.295e-05 1.203e-04 -0.523 0.6055 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.269 on 24 degrees of freedom Multiple R-squared: 0.8903, Adjusted R-squared: 0.8583 F-statistic: 27.82 on 7 and 24 DF, p-value: 4.908e-10"What? Are you trying to tell me that we can't predict mpg from any combination of these three variables? I don't buy it!" Well, good for you! We've already seen that all three predictors are significantly correlated with the response well beyond alpha = .01. We could use any one of these predictors to get useful predictions of mpg. So something isn't right here! The first thing we have to remember is that, other than the three-way interaction, we're not looking at effects. Our regression equation is of this form. Pr = b_{0} + b_{1}X + b_{2}Y + b_{3}Z + b_{4}XY + b_{5}XZ + b_{6}YZ + b_{7}XYZ Therefore, just to look at one example...Pr = b_{0} + (b_{1} + b_{4}Y + b_{5}Z + b_{7}YZ)X + b_{2}Y + b_{3}Z + b_{6}YZ We're not looking at "an effect of X." We're looking at a simple effect of X (displacement) when Y (horsepower) and Z (weight) are both 0. We should know immediately, that's not interesting! How many cars do you think there were back in the 1970s with zero horsepower and zero weight? I'm guessing, not a lot! (Today maybe, but not in the seventies.)Something else that should have clued us in here is that, even though we have no significant coefficients (no coefficients significantly different from zero), we have R^{2} = .89, which is "off the charts" significant, F(7,24) = 27.82, p = 5x10^{-10}. There's something useful in there somewhere! If you have the "car" package installed (see
Package Management), you can do this to see another problem we're having.
> library("car") # gives an error if "car" is not installed > vif(lm.out) disp hp wt disp:hp disp:wt hp:wt 804.17158 194.14141 79.86471 1230.15630 1811.01199 606.98058 disp:hp:wt 2271.83347The function vif() calculates variance inflation factors. Collinearity produces variance inflation that affects the error terms for the coefficient tests. The variance inflation factors tell us by how much. An ideal variance inflation factor is 1 (no inflation). Anything greater than 10 is considered to be high. I'd say we have a problem! No matter how many problems we fix, however, that three-way interaction is
never going to get any better, so I suggest we drop it and proceed from there.
> lm.out = lm(mpg ~ (disp + hp + wt)^2, data=cars) > summary(lm.out) Call: lm(formula = mpg ~ (disp + hp + wt)^2, data = cars) Residuals: Min 1Q Median 3Q Max -3.1300 -1.5822 -0.5335 1.5777 4.0419 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 48.2558247 4.9028899 9.842 4.41e-10 *** disp -0.0198651 0.0385595 -0.515 0.6110 hp -0.1193411 0.0688717 -1.733 0.0955 . wt -6.2016994 2.7159733 -2.283 0.0312 * disp:hp 0.0001238 0.0001388 0.892 0.3812 disp:wt -0.0005743 0.0078207 -0.073 0.9420 hp:wt 0.0181979 0.0247729 0.735 0.4694 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.236 on 25 degrees of freedom Multiple R-squared: 0.889, Adjusted R-squared: 0.8624 F-statistic: 33.38 on 6 and 25 DF, p-value: 9.166e-11Well, at last, a significant coefficient. But once again, it's a meaningless coefficient, because it's the simple effect of weight at displacement and horsepower equal to 0. (Note: I should point out that these simple effects are not automatically meaningless. They are meaningless in this case because there can't be a car with displacement and horsepower equal to 0. It's not only outside the range of our data, it's outside the range of possibility.) > vif(lm.out) disp hp wt disp:hp disp:wt hp:wt 141.63402 138.27734 43.79523 139.80537 169.07598 444.99536And our variance inflation factors are still pretty high. Nevertheless, R-squared is still lingering around .89, and adjusted R-squared actually went up (because we dropped a useless term from the model). Let's fix the "meaningless simple effects" problem by making 0 a meaningful
value for each of the predictors. We do that by a procedure called mean
centering. To do that, we transform each of the variables by subtracting
the mean, creating a mean-centered variable. That will make 0 the mean of
each of those variables, and that will make the simple effects shown in the
regression output, for example, the simple effect of X at the
means of Y and Z (because the means now happen to be 0 if we use the
centered version of these variables).
> cars$disp.c = cars$disp - mean(cars$disp) > cars$hp.c = cars$hp - mean(cars$hp) > cars$wt.c = cars$wt - mean(cars$wt) > summary(cars) # not shown, but notice the means of our centered variablesNow we'll recreate our model using the centered variables. > lm.out = lm(mpg ~ (disp.c + hp.c + wt.c)^2, data=cars) > summary(lm.out) Call: lm(formula = mpg ~ (disp.c + hp.c + wt.c)^2, data = cars) Residuals: Min 1Q Median 3Q Max -3.1300 -1.5822 -0.5335 1.5777 4.0419 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.5654742 0.6383450 29.084 < 2e-16 *** disp.c -0.0035550 0.0092818 -0.383 0.704947 hp.c -0.0322341 0.0111916 -2.880 0.008037 ** wt.c -3.6647913 0.9398638 -3.899 0.000641 *** disp.c:hp.c 0.0001238 0.0001388 0.892 0.381151 disp.c:wt.c -0.0005743 0.0078207 -0.073 0.942049 hp.c:wt.c 0.0181979 0.0247729 0.735 0.469425 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.236 on 25 degrees of freedom Multiple R-squared: 0.889, Adjusted R-squared: 0.8624 F-statistic: 33.38 on 6 and 25 DF, p-value: 9.166e-11Now we're cookin'! Now we have a significant simple effect of weight at the means of displacement and horsepower, and we have a significant simple effect of horsepower at the means of displacement and weight. Those are some simple effects we wrap our minds around. We also got another advantage from centering the predictors. > vif(lm.out) disp.c hp.c wt.c disp.c:hp.c disp.c:wt.c hp.c:wt.c 8.206679 3.651385 5.244518 4.009015 7.780673 10.535377The variance inflation factors are a lot more reasonable. However, notice that the evaluation of the interactions hasn't changed at
all. Should we just toss them out and go with an additive model? Let's not get
impatient! Let's compare the additive model to the one we currently have.
> lm.add = lm(mpg ~ (disp.c + hp.c + wt.c), data=cars) > anova(lm.add, lm.out) Analysis of Variance Table Model 1: mpg ~ (disp.c + hp.c + wt.c) Model 2: mpg ~ (disp.c + hp.c + wt.c)^2 Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 194.99 2 25 124.97 3 70.02 4.669 0.01006 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Throwing out the interactions would significantly change the model, and since deleting terms never makes the model better, that means the model without the interactions is significantly worse. Somewhere among those interactions is something useful. Model Reduction We now have three choices, so far as I can see. We can use theory (something we know about cars) to pick one of those interactions to delete. Or we could start by deleting the least significant one. Or we could use a stepwise regression to delete terms. In any event, once we start our process of model simplification, we should do it ONE TERM AT A TIME. Anybody know enough about cars to pick one of those interactions to delete?
Not me. I'm going to run the stepwise regression, but I'm not going to look at
the result until I've pruned this model by hand. Then I'll check to see if the
computer agrees with me.
> lm.step = step(lm.out, trace=0) # trace=0 hides the resultsNow I'm going to toss the least significant interaction. > lm.out = lm(mpg ~ disp.c + hp.c + wt.c + disp.c:wt.c + hp.c:wt.c, data=cars) > summary(lm.out) # some output not shown Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.899488 0.514790 36.713 < 2e-16 *** disp.c -0.002443 0.009161 -0.267 0.791797 hp.c -0.029497 0.010720 -2.752 0.010659 * wt.c -3.853619 0.912067 -4.225 0.000259 *** disp.c:wt.c -0.001644 0.007698 -0.214 0.832508 hp.c:wt.c 0.031830 0.019414 1.640 0.113151Okay, disp.c:wt.c goes next. > lm.out = lm(mpg ~ disp.c + hp.c + wt.c + hp.c:wt.c, data=cars) > summary(lm.out) # some output not shown Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.890970 0.504092 37.475 < 2e-16 *** disp.c -0.003021 0.008597 -0.351 0.727982 hp.c -0.028512 0.009505 -3.000 0.005751 ** wt.c -3.885147 0.883996 -4.395 0.000155 *** hp.c:wt.c 0.028022 0.007555 3.709 0.000950 ***Aha! Looks like we have one more term to delete to get to the minimal adequate model. Check to see if it's involved in any interactions first. No? Then out it goes! > lm.out = lm(mpg ~ hp.c + wt.c + hp.c:wt.c, data=cars) > summary(lm.out) Call: lm(formula = mpg ~ hp.c + wt.c + hp.c:wt.c, data = cars) Residuals: Min 1Q Median 3Q Max -3.0632 -1.6491 -0.7362 1.4211 4.5513 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.898400 0.495703 38.124 < 2e-16 *** hp.c -0.030508 0.007503 -4.066 0.000352 *** wt.c -4.131649 0.529558 -7.802 1.69e-08 *** hp.c:wt.c 0.027848 0.007420 3.753 0.000811 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.153 on 28 degrees of freedom Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724 F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13Finally, let's see how I did relative to the computer. > lm.step Call: lm(formula = mpg ~ hp.c + wt.c + hp.c:wt.c, data = cars) Coefficients: (Intercept) hp.c wt.c hp.c:wt.c 18.89840 -0.03051 -4.13165 0.02785Looks like the computer and I have arrived at the same model. Good job, computer! Of the three variables we started with, we have retained horsepower and weight
and their mutual interaction. Let's see how we're doing with variance inflation.
> vif(lm.out) hp.c wt.c hp.c:wt.c 1.770198 1.795911 1.019382Outstanding! Regression diagnostics? > par(mfrow=c(2,2)) > plot(lm.out, c(1,2,4,5)) > confint(lm.out, level=.99) 0.5 % 99.5 % (Intercept) 17.5286437 20.268156481 hp.c -0.0512404 -0.009774873 wt.c -5.5949569 -2.668341231 hp.c:wt.c 0.0073459 0.048350396 What Does The Interaction Look Like? There are several plots that can be used to visualize interactions, but I
prefer the conditioning plot. It has a lot of options. I'll try to illustrate
the most important ones.
> coplot(mpg ~ hp | wt, # formula: mpg by hp given wt + data=cars, # where the data are (the function can be ended here) + rows=1, # number of rows to put the graphs in + number=3, # number of graphs to plot + overlap=0) # don't overlap values of wt across graphs The top panel shows the ranges of "wt" that are covered in each of the graphs. The bottom panel shows a scatterplot of "mpg" by "hp" within each of those ranges. The only other option I might have set here was "panel=panel.smooth", which would have plotted a smoothed regression line on each of the plots. And possibly "show.given=F", which would turn off the top panel of the graphic. You can play with the options to see how they alter the graph. The only necessary parts are the formula and the data= argument. The coplot is telling us that, at low values of vehicle weight, the relationship between mpg and hp is very steep. That is, gas mileage drops off rapidly as horsepower increases. At moderate values of vehicle weight the relationship is less steep but still strongly negative. Finally, at high vehicle weights, increasing horsepower has much less effect but still appears to be slightly negative. Moderator Effects In many cases, our primary interest is in one of our predictors, which we think has a causal relationship to the response. The causal relationship must be justified through the way we designed the experiment, randomization to groups for example, or through some theoretical consideration. Let's say our primary interest in this example is in the predictor variable engine horsepower (hp). However, we don't think the relationship between gas mileage and horsepower is a simple, straightforward one. Rather, we think the precise form of the relationship depends upon the weight of the vehicle. In other words, we think weight is a moderator variable. If it is, then the moderator effect will show up as a horsepower-by-weight interaction in the regression analysis. Fred, in the lab across the street, is primarily interested in the relationship between gas mileage and the weight of the vehicle. He thinks the precise form of that relationship is altered depending upon the horsepower of the engine. Thus, to Fred, horsepower is the moderator variable. The key element of moderator analysis is our belief that the influences we are seeing are causal. We can't say, in many regression analyses, what kind of causal relationship there might be between our variables. If we are trying to predict college GPA from SAT scores, for example, we can't really say that SAT scores cause GPA. There is clearly a relationship that we can use for prediction and decision making, but it's not a causal one. Therefore, if we include a variable that we think interacts with SAT in our analysis, we shouldn't really call what we're doing a moderator analysis. Not all interactions are moderator effects. Furthermore, in a moderator effect, unlike in a mediator effect, neither the IV nor the moderator variable have causal priority. They don't act directly on each other. They both act simultaneously, more or less, through the dependent variable. Finally, "it is desirable that the moderator variable be uncorrelated with both the predictor and the criterion (the dependent variable) to provide a clearly interpretable interaction term" (Baron and Kenny, 1986).
created 2016 March 27 |