R Tutorials--Multiple Regression with Interactions

MULTIPLE REGRESSION WITH INTERACTIONS

Preliminary Notes

"Seriously? This has to be covered separately?" I'm afraid it does. Interaction of continuous variables is not a simple thing. Some people refer to the analysis of interactions in multiple regression as moderator analysis and to the interacting variable as a moderating variable. I'll talk a little about that below, but moderator analysis is only a subset of interaction analysis.

I'll be discussing one particular type of interaction, called a bilinear interaction, which is represented in the regression equation as a product term. You should be aware that there are other ways that continuous variables can interact. Failing to find a significiant product term in a multiple regression analysis is not sufficient evidence to say that the variables are not interacting.

Also, if you haven't read the Model Formulae tutorial yet, well, I'm sure by now you know the drill. It would also be helpful if you were familiar with Multiple Regression. (Come on, King! Did you really need to say that?)

Product Terms and What They Mean

In the case of simple linear regression, our analysis resulted in a regression equation of this form.

Y-hat = b₀ + b₁X

Y-hat stands for the predicted value of Y and is usually written as a Y with a caret symbol, ^, over it. I'm not sure I can program that it HTML, and even if I could, there's no guarantee that your browser would faithfully reproduce it. The traditional reading of the symbol is "Y hat". I don't want to type that a whole lot, but I also don't want to simply use Y because students, especially, often forget that we are using the regression equation to calculate predicted values of Y, and not the actual observed values. So I'm going to break with precedent and use Pr to stand for the predicted value we get from a regression equation. And since I'm not a big fan of subscripts, especially when I have to program them into HTML, that will allow me to use X, Y, and Z as the predictor variables in place of X₁, etc. So our simple linear regression equation becomes...

Pr = b₀ + b₁X

In multiple regression without interactions, we added more predictors to the regression equation.

Pr = b₀ + b₁X + b₂Y

Pr = b₀ + b₁X + b₂Y + b₃Z

Etc. In principle, there is no limit to the number of predictors that can be added, although with n subjects, once we get to k-1 predictors, no matter what they are, we are making perfect predictions and learning nothing. So generally we are advised to hold it down to a reasonable, i.e., parsimonious, number of predictors.

To test for the presence of an interaction between X and Y, customarily an XY product term is added to the equation.

Pr = b₀ + b₁X + b₂Y + b₃XY

As I've already pointed out, that's not the only way that predictors can interact, but when you ask your software to test for an interaction, that's almost certainly the test you're going to get, unless you go to some trouble to specify something different. That will be the test you get in R if you use the X:Y or X*Y notation in the model formula. What does it mean?

In the case of simple linear regression (one predictor), the regression equation specifies the graph of a straight line, called the regression line. In the case of multiple regression with two predictors and no interaction, the regression line become a regression plane in 3D space. (Don't try to visualize what it is with more predictors. You could sprain your visual cortex!)

The plane represents the predicted values (Pr), but the actual or observed values, in most cases, will not be on the plane but some distance above or below it. The vertical distance from a predicted value to the corresponding observed value is called a residual. Points above the plane, like the illustrated one, have positive residuals. Points below the plane have negative residuals.

All of the lines in the grid representing this plane are straight. (Geometrically, they are "lines," not "curves.") All of the lines running from left to right have the same slope, which is the value of b₁ in the regression equation, and is called the effect of X (sometimes the main effect of X). All of the lines running from front to back have the same slope, which is the value of b₂ in the regression equation, and is called the effect of Y (sometimes the main effect of Y). The point where the plane touches the Pr axis is the intercept, which has the value b₀ in the regression equation.

Add a product interaction term, b₃XY, and this causes the plane to become twisted. Not bowed, twisted. All of the lines in the grid are still straight (they are still "lines"), but the lines running from left to right no longer all have the same slope, and the lines running from front to back no longer all have the same slope.

Each of the lines in this grid running from left to right (and all other lines that you might imagine doing so) are called simple effects of X. Likewise, all of the lines running from front to back are called simple effects of Y. Those simple effects are still linear, hence, bilinear interaction. But notice that the size of any simple effect of X depends upon the value of Y, and the size of any simple effect of Y depends upon the value of X. This can be shown from the regression equation with a little bit of algebra.

Pr = b₀ + b₁X + b₂Y + b₃XY
Pr = b₀ + (b₁ + b₃Y)X + b₂Y
Pr = b₀ + b₁X + (b₂ + b₃X)Y

The slope of any line going from left to right, a simple effect of X, is b₁ + b₃Y. The slope of any line going from front to back, a simple effect of Y, is b₂ + b₃X. Thus, the coefficient of the product term, b₃, specifies how much the simple effect of X changes as Y changes, and how much the simple effect of Y changes as X changes.

Perhaps most importantly, without the product term (interaction), b₁ is the effect of X at any value of Y, and b₂ is the effect of Y at any value of X. Hence, the coefficients are rightfully referred to as "effects." With an interaction in force, that is no longer true. With a product term, b₁ now represents the effect of X only when Y is equal to 0, and b₂ represents the effect of Y only when X is equal to 0. In other words, b₁ and b₂ are now simple effect of X and Y when the other equals 0. You might want to study the above diagram for a moment to get the full implications of this. Notice that X can have a pretty big effect on Pr, but when Y=0, the effect of X is minimal. Same for Y.

Finally, An Example!

I'm going to try to figure out what variables are related to gas mileage in cars using some data from the 1970s (because that's the kinda guy I am). The built-in data set is called "mtcars", and I'm going to take only some of the variables available in it to keep things "simple."

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

> cars = mtcars[,c(1,3,4,6)]
> dim(cars)
[1] 32  4
> summary(cars)
      mpg             disp             hp              wt       
 Min.   :10.40   Min.   : 71.1   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :196.3   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :230.7   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :472.0   Max.   :335.0   Max.   :5.424
 
> cor(cars)        # r = .449 is the critical value, two-tailed, at alpha = .01
            mpg       disp         hp         wt
mpg   1.0000000 -0.8475514 -0.7761684 -0.8676594
disp -0.8475514  1.0000000  0.7909486  0.8879799
hp   -0.7761684  0.7909486  1.0000000  0.6587479
wt   -0.8676594  0.8879799  0.6587479  1.0000000

The response is "mpg" (miles per U.S. gallon), and we have three predictors available, "disp" (engine displacement), "hp" (horsepower), and "wt" (weight of the car). All three predictors are very strongly negatively correlated with the response. They are also strongly correlated with each other, meaning we might have a problem with collinearity here. Adding product terms to the model will very likely make that problem even more severe. For example...

> with(cars, cor(wt, wt*hp))
[1] 0.8647691

Not pretty!

Let's check to see if the relationships are reasonably linear.

> pairs(cars, cex=.6)

It appears to me that the relationship of mpg to all three of the predictors might be a tad curvilinear. Once again, to keep things simple, I'm going to ignore it. We'll worry about the collinearity problem below.

In the last tutorial, we got additive models by separating the predictors in the model formula with plus signs. Now we want models including the interactions, so instead of using +, we use asterisks, *.

> lm.out = lm(mpg ~ disp * hp * wt, data=cars)

There are several extractor functions we can use to get information from our model object.

> coef(lm.out)          # not shown
> fitted(lm.out)        # not shown
> residuals(lm.out)     # not shown

But by far the most useful one is...

> summary(lm.out)

Call:
lm(formula = mpg ~ disp * hp * wt, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9439 -1.5613 -0.5286  1.5307  3.9042 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.255e+01  9.596e+00   5.476 1.25e-05 ***
disp        -6.416e-02  9.324e-02  -0.688   0.4980    
hp          -1.426e-01  8.282e-02  -1.722   0.0980 .  
wt          -7.511e+00  3.722e+00  -2.018   0.0549 .  
disp:hp      3.297e-04  4.180e-04   0.789   0.4379    
disp:wt      1.237e-02  2.598e-02   0.476   0.6382    
hp:wt        2.614e-02  2.936e-02   0.890   0.3822    
disp:hp:wt  -6.295e-05  1.203e-04  -0.523   0.6055    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.269 on 24 degrees of freedom
Multiple R-squared:  0.8903,	Adjusted R-squared:  0.8583 
F-statistic: 27.82 on 7 and 24 DF,  p-value: 4.908e-10

"What? Are you trying to tell me that we can't predict mpg from any combination of these three variables? I don't buy it!" Well, good for you! We've already seen that all three predictors are significantly correlated with the response well beyond alpha = .01. We could use any one of these predictors to get useful predictions of mpg. So something isn't right here!

The first thing we have to remember is that, other than the three-way interaction, we're not looking at effects. Our regression equation is of this form.

Pr = b₀ + b₁X + b₂Y + b₃Z + b₄XY + b₅XZ + b₆YZ + b₇XYZ

Therefore, just to look at one example...

Pr = b₀ + (b₁ + b₄Y + b₅Z + b₇YZ)X + b₂Y + b₃Z + b₆YZ

We're not looking at "an effect of X." We're looking at a simple effect of X (displacement) when Y (horsepower) and Z (weight) are both 0. We should know immediately, that's not interesting! How many cars do you think there were back in the 1970s with zero horsepower and zero weight? I'm guessing, not a lot! (Today maybe, but not in the seventies.)

Something else that should have clued us in here is that, even though we have no significant coefficients (no coefficients significantly different from zero), we have R² = .89, which is "off the charts" significant, F(7,24) = 27.82, p = 5x10^-10. There's something useful in there somewhere!

If you have the "car" package installed (see Package Management), you can do this to see another problem we're having.

> library("car")        # gives an error if "car" is not installed
> vif(lm.out)
      disp         hp         wt    disp:hp    disp:wt      hp:wt 
 804.17158  194.14141   79.86471 1230.15630 1811.01199  606.98058 
disp:hp:wt 
2271.83347

The function vif() calculates variance inflation factors. Collinearity produces variance inflation that affects the error terms for the coefficient tests. The variance inflation factors tell us by how much. An ideal variance inflation factor is 1 (no inflation). Anything greater than 10 is considered to be high. I'd say we have a problem!

No matter how many problems we fix, however, that three-way interaction is never going to get any better, so I suggest we drop it and proceed from there.

> lm.out = lm(mpg ~ (disp + hp + wt)^2, data=cars)
> summary(lm.out)

Call:
lm(formula = mpg ~ (disp + hp + wt)^2, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1300 -1.5822 -0.5335  1.5777  4.0419 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 48.2558247  4.9028899   9.842 4.41e-10 ***
disp        -0.0198651  0.0385595  -0.515   0.6110    
hp          -0.1193411  0.0688717  -1.733   0.0955 .  
wt          -6.2016994  2.7159733  -2.283   0.0312 *  
disp:hp      0.0001238  0.0001388   0.892   0.3812    
disp:wt     -0.0005743  0.0078207  -0.073   0.9420    
hp:wt        0.0181979  0.0247729   0.735   0.4694    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.236 on 25 degrees of freedom
Multiple R-squared:  0.889,	Adjusted R-squared:  0.8624 
F-statistic: 33.38 on 6 and 25 DF,  p-value: 9.166e-11

Well, at last, a significant coefficient. But once again, it's a meaningless coefficient, because it's the simple effect of weight at displacement and horsepower equal to 0. (Note: I should point out that these simple effects are not automatically meaningless. They are meaningless in this case because there can't be a car with displacement and horsepower equal to 0. It's not only outside the range of our data, it's outside the range of possibility.)

> vif(lm.out)
     disp        hp        wt   disp:hp   disp:wt     hp:wt 
141.63402 138.27734  43.79523 139.80537 169.07598 444.99536

And our variance inflation factors are still pretty high. Nevertheless, R-squared is still lingering around .89, and adjusted R-squared actually went up (because we dropped a useless term from the model).

Let's fix the "meaningless simple effects" problem by making 0 a meaningful value for each of the predictors. We do that by a procedure called mean centering. To do that, we transform each of the variables by subtracting the mean, creating a mean-centered variable. That will make 0 the mean of each of those variables, and that will make the simple effects shown in the regression output, for example, the simple effect of X at the means of Y and Z (because the means now happen to be 0 if we use the centered version of these variables).

> cars$disp.c = cars$disp - mean(cars$disp)
> cars$hp.c = cars$hp - mean(cars$hp)
> cars$wt.c = cars$wt - mean(cars$wt)
> summary(cars)         # not shown, but notice the means of our centered variables

Now we'll recreate our model using the centered variables.

> lm.out = lm(mpg ~ (disp.c + hp.c + wt.c)^2, data=cars)
> summary(lm.out)

Call:
lm(formula = mpg ~ (disp.c + hp.c + wt.c)^2, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1300 -1.5822 -0.5335  1.5777  4.0419 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.5654742  0.6383450  29.084  < 2e-16 ***
disp.c      -0.0035550  0.0092818  -0.383 0.704947    
hp.c        -0.0322341  0.0111916  -2.880 0.008037 ** 
wt.c        -3.6647913  0.9398638  -3.899 0.000641 ***
disp.c:hp.c  0.0001238  0.0001388   0.892 0.381151    
disp.c:wt.c -0.0005743  0.0078207  -0.073 0.942049    
hp.c:wt.c    0.0181979  0.0247729   0.735 0.469425    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.236 on 25 degrees of freedom
Multiple R-squared:  0.889,	Adjusted R-squared:  0.8624 
F-statistic: 33.38 on 6 and 25 DF,  p-value: 9.166e-11

Now we're cookin'! Now we have a significant simple effect of weight at the means of displacement and horsepower, and we have a significant simple effect of horsepower at the means of displacement and weight. Those are some simple effects we wrap our minds around. We also got another advantage from centering the predictors.

> vif(lm.out)
     disp.c        hp.c        wt.c disp.c:hp.c disp.c:wt.c   hp.c:wt.c 
   8.206679    3.651385    5.244518    4.009015    7.780673   10.535377

The variance inflation factors are a lot more reasonable.

However, notice that the evaluation of the interactions hasn't changed at all. Should we just toss them out and go with an additive model? Let's not get impatient! Let's compare the additive model to the one we currently have.

> lm.add = lm(mpg ~ (disp.c + hp.c + wt.c), data=cars)
> anova(lm.add, lm.out)
Analysis of Variance Table

Model 1: mpg ~ (disp.c + hp.c + wt.c)
Model 2: mpg ~ (disp.c + hp.c + wt.c)^2
  Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
1     28 194.99                             
2     25 124.97  3     70.02 4.669 0.01006 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Throwing out the interactions would significantly change the model, and since deleting terms never makes the model better, that means the model without the interactions is significantly worse. Somewhere among those interactions is something useful.

Model Reduction

We now have three choices, so far as I can see. We can use theory (something we know about cars) to pick one of those interactions to delete. Or we could start by deleting the least significant one. Or we could use a stepwise regression to delete terms. In any event, once we start our process of model simplification, we should do it ONE TERM AT A TIME.

Anybody know enough about cars to pick one of those interactions to delete? Not me. I'm going to run the stepwise regression, but I'm not going to look at the result until I've pruned this model by hand. Then I'll check to see if the computer agrees with me.

> lm.step = step(lm.out, trace=0)      # trace=0 hides the results

Now I'm going to toss the least significant interaction.

> lm.out = lm(mpg ~ disp.c + hp.c + wt.c + disp.c:wt.c + hp.c:wt.c, data=cars)
> summary(lm.out)       # some output not shown
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.899488   0.514790  36.713  < 2e-16 ***
disp.c      -0.002443   0.009161  -0.267 0.791797    
hp.c        -0.029497   0.010720  -2.752 0.010659 *  
wt.c        -3.853619   0.912067  -4.225 0.000259 ***
disp.c:wt.c -0.001644   0.007698  -0.214 0.832508    
hp.c:wt.c    0.031830   0.019414   1.640 0.113151

Okay, disp.c:wt.c goes next.

> lm.out = lm(mpg ~ disp.c + hp.c + wt.c + hp.c:wt.c, data=cars)
> summary(lm.out)       # some output not shown
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.890970   0.504092  37.475  < 2e-16 ***
disp.c      -0.003021   0.008597  -0.351 0.727982    
hp.c        -0.028512   0.009505  -3.000 0.005751 ** 
wt.c        -3.885147   0.883996  -4.395 0.000155 ***
hp.c:wt.c    0.028022   0.007555   3.709 0.000950 ***

Aha! Looks like we have one more term to delete to get to the minimal adequate model. Check to see if it's involved in any interactions first. No? Then out it goes!

> lm.out = lm(mpg ~ hp.c + wt.c + hp.c:wt.c, data=cars)
> summary(lm.out)

Call:
lm(formula = mpg ~ hp.c + wt.c + hp.c:wt.c, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.898400   0.495703  38.124  < 2e-16 ***
hp.c        -0.030508   0.007503  -4.066 0.000352 ***
wt.c        -4.131649   0.529558  -7.802 1.69e-08 ***
hp.c:wt.c    0.027848   0.007420   3.753 0.000811 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,	Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

Finally, let's see how I did relative to the computer.

> lm.step

Call:
lm(formula = mpg ~ hp.c + wt.c + hp.c:wt.c, data = cars)

Coefficients:
(Intercept)         hp.c         wt.c    hp.c:wt.c  
   18.89840     -0.03051     -4.13165      0.02785

Looks like the computer and I have arrived at the same model. Good job, computer!

Of the three variables we started with, we have retained horsepower and weight and their mutual interaction. Let's see how we're doing with variance inflation.

> vif(lm.out)
     hp.c      wt.c hp.c:wt.c 
 1.770198  1.795911  1.019382

Outstanding! Regression diagnostics?

> par(mfrow=c(2,2))
> plot(lm.out, c(1,2,4,5))

We have a few cases which some might consider influential, but all things considered, I'm pleased with that. Finally, let's get some confidence intervals on our regression coefficients.

> confint(lm.out, level=.99)
                 0.5 %       99.5 %
(Intercept) 17.5286437 20.268156481
hp.c        -0.0512404 -0.009774873
wt.c        -5.5949569 -2.668341231
hp.c:wt.c    0.0073459  0.048350396

What Does The Interaction Look Like?

There are several plots that can be used to visualize interactions, but I prefer the conditioning plot. It has a lot of options. I'll try to illustrate the most important ones.

> coplot(mpg ~ hp | wt,      # formula: mpg by hp given wt
+        data=cars,          # where the data are (the function can be ended here)
+        rows=1,             # number of rows to put the graphs in
+        number=3,           # number of graphs to plot
+        overlap=0)          # don't overlap values of wt across graphs

The top panel shows the ranges of "wt" that are covered in each of the graphs. The bottom panel shows a scatterplot of "mpg" by "hp" within each of those ranges. The only other option I might have set here was "panel=panel.smooth", which would have plotted a smoothed regression line on each of the plots. And possibly "show.given=F", which would turn off the top panel of the graphic. You can play with the options to see how they alter the graph. The only necessary parts are the formula and the data= argument.

The coplot is telling us that, at low values of vehicle weight, the relationship between mpg and hp is very steep. That is, gas mileage drops off rapidly as horsepower increases. At moderate values of vehicle weight the relationship is less steep but still strongly negative. Finally, at high vehicle weights, increasing horsepower has much less effect but still appears to be slightly negative.

Moderator Effects

In many cases, our primary interest is in one of our predictors, which we think has a causal relationship to the response. The causal relationship must be justified through the way we designed the experiment, randomization to groups for example, or through some theoretical consideration. Let's say our primary interest in this example is in the predictor variable engine horsepower (hp).

However, we don't think the relationship between gas mileage and horsepower is a simple, straightforward one. Rather, we think the precise form of the relationship depends upon the weight of the vehicle. In other words, we think weight is a moderator variable. If it is, then the moderator effect will show up as a horsepower-by-weight interaction in the regression analysis.

Fred, in the lab across the street, is primarily interested in the relationship between gas mileage and the weight of the vehicle. He thinks the precise form of that relationship is altered depending upon the horsepower of the engine. Thus, to Fred, horsepower is the moderator variable.

The key element of moderator analysis is our belief that the influences we are seeing are causal. We can't say, in many regression analyses, what kind of causal relationship there might be between our variables. If we are trying to predict college GPA from SAT scores, for example, we can't really say that SAT scores cause GPA. There is clearly a relationship that we can use for prediction and decision making, but it's not a causal one. Therefore, if we include a variable that we think interacts with SAT in our analysis, we shouldn't really call what we're doing a moderator analysis. Not all interactions are moderator effects.

Furthermore, in a moderator effect, unlike in a mediator effect, neither the IV nor the moderator variable have causal priority. They don't act directly on each other. They both act simultaneously, more or less, through the dependent variable. Finally, "it is desirable that the moderator variable be uncorrelated with both the predictor and the criterion (the dependent variable) to provide a clearly interpretable interaction term" (Baron and Kenny, 1986).

Baron, R. M. and Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173-1182.

created 2016 March 27