R Tutorials--Analysis of Covariance

ANALYSIS OF COVARIANCE

Some people view ANCOVA as being a multiple regression technique. Others see it as an offshoot of ANOVA. In R, ANCOVA can be done with equal success using either the lm() function or the aov() function. I find it easier to treat ANCOVA as regression, so that's what I'll do.

Everything is regression, after all. The previous tutorial was called Multiple Regression, and we treated only cases where all the explanatory, or predictor, variables were numeric. It's possible to include categorical predictors (IVs) if they are coded correctly. In fact, if all we have are categorical IVs, then we are essentially doing ANOVA. If there is a mix of numeric and categorical IVs, then we are in the realm of ANCOVA.

Gas Mileage

A car with an eight cylinder engine... What teenaged boy hasn't dreamed of owning one? Well, get one now before they become extinct, because they are hard on gasoline! Of course, cars with big engines also tend to weigh more than cars with smaller engines, so maybe it's not the number of cylinders in the engine so much as it is the weight of the car that costs us at the gas pump. I wonder how we might tease those two variables apart.

> names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
> cars = mtcars[,c(1,2,6)]
> cars$cyl = factor(cars$cyl)
> summary(cars)
      mpg        cyl          wt       
 Min.   :10.40   4:11   Min.   :1.513  
 1st Qu.:15.43   6: 7   1st Qu.:2.581  
 Median :19.20   8:14   Median :3.325  
 Mean   :20.09          Mean   :3.217  
 3rd Qu.:22.80          3rd Qu.:3.610  
 Max.   :33.90          Max.   :5.424

The "mtcars" data set contains information on cars that were road-tested by Motor Trend magazine in 1974 (or road tests of which were published in 1974). That's a long time ago, unless you're me, and you're still dreaming of that 1967 Mustang! It's the data we have, so we'll go with it. You can update the data on your own time if you want to see if relationships we discover from 1974 still hold today.

We have extracted three columns from that data frame, namely, mpg (gas mileage in miles per U.S. gallon), cyl (the number of cylinders in the engine), and wt (weight in 1000s of pounds). The "cyl" variable has sensibly been declared a factor. Let's begin by looking at some graphs.

> par(mfrow=c(1,3))
> boxplot(mpg ~ cyl, data=cars, xlab="cylinders", ylab="mpg")
> boxplot(wt ~ cyl, data=cars, xlab="cylinders", ylab="weight")
> plot(mpg ~ wt, data=cars)

That doesn't leave much to the imagination, does it?! The more cylinders the car has (had!), the harder it is on gas. The more cylinders the car has, the more it weighs. And the more the car weighs, the lower its gas mileage. All three of those relationships appear impressively strong in the graphs.

An analysis of variance will no doubt show us in p-values what we can already plainly see from looking at the boxplots.

> lm.out1 = lm(mpg ~ cyl, data=cars)   # aov() works as lm() internally
> anova(lm.out1)                       # same as summary.aov(lm.out1)
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
cyl        2 824.78  412.39  39.697 4.979e-09 ***
Residuals 29 301.26   10.39                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> summary(lm.out1)

Call:
lm(formula = mpg ~ cyl, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
cyl6         -6.9208     1.5583  -4.441 0.000119 ***
cyl8        -11.5636     1.2986  -8.905 8.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,	Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

No surprises in the ANOVA output. The interesting stuff is in the regression output. The (estimated) coefficients relate to the means of "cyl". The base, or reference, level of "cyl" is the one not listed with a coefficient in the regression output, or cyl4. The intercept (26.66) is the mean gas mileage for that level of "cyl". The coefficient for cyl6 is -6.92, which means that six cylinder cars get (got) 6.92 mpg less than four cylinder cars, on the average. Eight cylinder cars got 11.56 mpg less than four cylinder cars. The significance tests on these coefficients are equivalent to Fisher LSD tests on the differences in means. These figures are easily enough confirmed. (Important note: This interpretation of the coefficients depends upon having contrasts for unordered factors set to "treatment". Do options("contrasts") to check.)

> with(cars, tapply(mpg, cyl, mean))
       4        6        8 
26.66364 19.74286 15.10000

The effect size is R² = .7325. In the language of ANOVA, that statistic would be called eta-squared. That is the "proportion of explained variation" in mpg accounted for by the explanatory variable. I don't think there's any doubt of a cause-and-effect relationship in this case (maybe a wee bit), but this does not mean that number of cylinders explains the whole 73% of variability in gas mileage. A more accurate statement would be cylinders and all confounds from other variables that might be confounded with cylinders account for 73% of the variability in gas mileage. The last line of the regression output duplicates the ANOVA output, because there is only one explanatory variable at this time.

We think one of the variables confounded with cylinders is weight of the vehicle. In fact, we think once those gas mileages are adjusted for the weight of the vehicle, the differences in the means by number of cylinders will be a lot less impressive. Dare we imagine that the differences might vanish altogether? I don't think so. But, let's see. Now for the analysis of covariance part.

> lm.out2 = lm(mpg ~ wt + cyl, data=cars)
> anova(lm.out2)
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq  F value    Pr(>F)    
wt         1 847.73  847.73 129.6650 5.079e-12 ***
cyl        2  95.26   47.63   7.2856  0.002835 ** 
Residuals 28 183.06    6.54                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> summary(lm.out2)

Call:
lm(formula = mpg ~ wt + cyl, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5890 -1.2357 -0.5159  1.3845  5.7915 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
wt           -3.2056     0.7539  -4.252 0.000213 ***
cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
cyl8         -6.0709     1.6523  -3.674 0.000999 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.557 on 28 degrees of freedom
Multiple R-squared:  0.8374,	Adjusted R-squared:   0.82 
F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

In the ANOVA, we have added "wt" (weight), a numeric variable that we think (i.e., know perfectly well in this case) is related to the DV as a covariate. Ordinarily, at least in the social sciences, we would do such a thing because we think the covariate will help discriminate among levels of the categorical IV, once the confound with the covariate is removed. Here we believe just the opposite is going to occur. We're not so terribly interested in the relationship between weight and gas mileage. We just want to get it out of the mix, so that we can see a "purer" relationship between cylinders and gas mileage.

Two very important points need to be made at this juncture. First, notice we didn't ask to see any interactions between "wt" and "cyl". That means we're assuming there aren't any, a common assumption in ANCOVA, but also an assumption that is commonly incorrect. It is worth checking. Second, in R, the ANOVA tests are sequential. That means if we want weight taken out before we look at the effect of cylinders, we MUST enter it first into the model formula. This won't make any difference in the regression output, but it makes a BIG difference in the ANOVA output!

Overall, we are now accounting for R² = .8374, or 84% of the total variability in gas mileage. However, notice that the part of the explained variability that we can attribute to "cyl" (eta-squared for "cyl") is now only...

> 95.26 / (847.73 + 95.26 + 183.06)
[1] 0.0845966

...or about 8%. The mighty have, indeed, fallen, but not so far that "cyl" is no longer a statistically significant effect. Are there other confounds with "cyl" that might remove that last 8%? I'll leave it up to you to look for them.

We can now calculate adjusted means of "cyl" by level. This is done using the regression equation, and we have a different regression equation for each of the three levels of "cyl".

mpg-hat = 33.9908 - 3.2056 * wt for cyl4
mpg-hat = 33.9908 - 3.2056 * wt - 4.2556 for cyl6
mpg-hat = 33.9908 - 3.2056 * wt - 6.0709 for cyl8
The terms for specific levels of "cyl" can be combined with, and thus will change, the intercept of the regression line, and usually they are, but I have left them separated here to illustrate what's happening. To get the adjusted means for levels of "cyl", we fill in the overall average of "wt" and solve each of these regression equations.

> mean(cars$wt)
[1] 3.21725
> 33.9908 - 3.2056 * mean(cars$wt)               # adjusted mean for cyl4
[1] 23.67758
> 33.9908 - 3.2056 * mean(cars$wt) - 4.2556      # adjusted mean for cyl6
[1] 19.42198
> 33.9908 - 3.2056 * mean(cars$wt) - 6.0709      # adjusted mean for cyl8
[1] 17.60668

Notice that the means are still different--and significantly different as we can tell from the ANOVA--but not so different as they were before we took out the effect of the weight of the car on gas mileage.

Here's a nice way to graph these results.

> plot(mpg ~ wt, data=cars, pch=as.numeric(cars$cyl), xlab="weight of car in 1000s of pounds")
> abline(a=33.99, b=-3.21, lty="solid")
> abline(a=33.99-4.26, b=-3.21, lty="dashed")
> abline(a=33.99-6.07, b=-3.21, lty="dotted")

The circles and solid line are for the four-cylinder cars, the triangles and dashed line for the six-cylinder cars, and the plus symbols and dotted line for the eight-cylinder cars. I'll refer you to the More Graphics tutorial for instructions on how to add all that information to the graph. Cars Regression

Two things distress me here that I feel compelled to comment on. First, notice there is very little overlap on the scatterplot in the weight of 4- vs. 6- vs. 8-cylinder cars. This makes the calculation of adjusted means just a little bit dicey. Also, while I don't detect anything obvious on the graph, I'm still uncomfortable that we haven't checked for the possibility of interactions between "wt" and "cyl".

> lm.out3 = lm(mpg ~ wt * cyl, data=cars)
> anova(lm.out2, lm.out3)
Analysis of Variance Table

Model 1: mpg ~ wt + cyl
Model 2: mpg ~ wt * cyl
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     28 183.06                           
2     26 155.89  2     27.17 2.2658 0.1239

Interactions make no significant contribution to the model, but I'd be happier if that F-value were closer to 1. (Note: This method of comparing models is explained in the Multiple Regression tutorial.)

An Aside

Above I noted that, in R, ANOVA is done sequentially, and that makes it very important to get the covariate into the model formula first. If you don't care for sequential sums of squares, you can try this, provided you have installed the optional "car" package from CRAN. (See Package Management.)

> library("car")                       # this will give an error if "car" is not installed
> Anova(lm.out2)                       # notice the upper case A in Anova
Anova Table (Type II tests)

Response: mpg
           Sum Sq Df F value   Pr(>F)    
wt        118.204  1 18.0801 0.000213 ***
cyl        95.263  2  7.2856 0.002835 ** 
Residuals 183.059 28                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These are so-called Type II sums of squares, and they are not calculated sequentially. I would hesitate to calculate eta-squares from this table unless I got the total SS by some method other than adding the SSs in this table. (See the tutorial on Unbalanced Designs for a discussion.)

Childhood Abuse and Adult Post-Traumatic Stress Disorder

The data for this section of the tutorial is discussed in Faraway (2005). To get the data, you can either install the "faraway" package, or you can go to Dr. Faraway's website and download the data files that accompany his book. The files will download into a folder called "jfdata.zip", which will unzip into a folder called "ascdata". Rename this folder "Faraway" and drop it into your working directory (Rspace).

If you have installed the "faraway" package, the commands to get the data are:

> library("faraway")
> data(sexab)

If you have downloaded the Faraway folder to your working directory, then do this:

> sexab = read.table("Faraway/sexab.txt", header=T, row.names=1)

Either way...

> summary(sexab)
      cpa               ptsd               csa    
 Min.   :-3.1204   Min.   :-3.349   Abused   :45  
 1st Qu.: 0.8321   1st Qu.: 6.173   NotAbused:31  
 Median : 2.0707   Median : 8.909                 
 Mean   : 2.3547   Mean   : 8.986                 
 3rd Qu.: 3.7387   3rd Qu.:12.240                 
 Max.   : 8.6469   Max.   :18.993

The data are from a study by Rodriquez, et al. (1997). The subjects were adult women who were being treated at a clinic, some of whom self-reported childhood sexual abuse (csa=Abused), and some of whom did not (csa=NotAbused). All of the women filled out two standardized scales, one assessing the severity of childhood physical abuse (cpa), and one assessing the severity of adult post-traumatic stress disorder (ptsd). On both of these scales, a higher score indicates a worse outcome. The question we will attempt to answer is, How are sexual and physical abuse that occur in childhood related to the severity of PTSD as an adult, and how does the effect of physical abuse differ for women who were or were not sexually abused as children?

When I discuss this example in my class, I always start off by having students examine the following scatterplot. (DON'T close this graphics window. We will be drawing regression lines on this graph eventually.)

> plot(ptsd ~ cpa, data=sexab, pch=as.numeric(csa), col=as.numeric(csa))

The red triangles are for women who were not sexually abused as children, and the black open circles are for women who were. I ask my students to tell me what they can see in this graph, and I hope against hope for the following answers.

There is a clear positive relationship between the severity of childhood physical abuse and the severity of adult PTSD.
Overall, women who were not sexually abused as children have lower severity of adult PTSD than do women who were sexually abused as children.
Overall, women who were not sexually abused as children also had lower severity of physical abuse as children than did women who were sexually abused as children.

Then I ask them the hard question. Does the effect of physical abuse add to the effect of sexual abuse, or not? When we summarize the relationships we see on this graph, and we going to have to do so with two regression lines, or will one be sufficient? Let's find out.

First, I propose we check for the importance of interactions.

> lm.sexab1 = lm(ptsd ~ cpa + csa, data=sexab)
> lm.sexab2 = lm(ptsd ~ cpa * csa, data=sexab)
> anova(lm.sexab1, lm.sexab2)
Analysis of Variance Table

Model 1: ptsd ~ cpa + csa
Model 2: ptsd ~ cpa * csa
  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1     73 782.08                          
2     72 774.28  1    7.8069 0.726  0.397

I like that. The F-value is near 1, so we can assume the absence of important interactions. That will make our analysis easier.

> anova(lm.sexab1)                     # sequential tests
Analysis of Variance Table

Response: ptsd
          Df Sum Sq Mean Sq F value    Pr(>F)    
cpa        1 449.80  449.80  41.984 9.462e-09 ***
csa        1 624.03  624.03  58.247 6.907e-11 ***
Residuals 73 782.08   10.71                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> summary(lm.sexab1)

Call:
lm(formula = ptsd ~ cpa + csa, data = sexab)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1567 -2.3643 -0.1533  2.1466  7.1417 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.2480     0.7187  14.260  < 2e-16 ***
cpa            0.5506     0.1716   3.209  0.00198 ** 
csaNotAbused  -6.2728     0.8219  -7.632 6.91e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.273 on 73 degrees of freedom
Multiple R-squared:  0.5786,	Adjusted R-squared:  0.5671 
F-statistic: 50.12 on 2 and 73 DF,  p-value: 2.002e-14

We have an answer. The estimated coefficient for csaNotAbused is statistically significant. When we include it in the regression equation(s), it will result in different intercepts for the csaAbused and csaNotAbused lines. Furthermore, the difference in intercepts, 6.3 points on a scale with a range of scores of a little more than 22 points, is quite large. The regression equations are:

ptsd-hat(Abused) = 10.2480 + 0.5506 * cpa
ptsd-hat(NotAbused) = 10.2480 + 0.5506 * cpa - 6.2728 = 3.9752 + 0.5506 * cpa

To draw the regression lines on the pre-existing scatterplot...

> abline(a=10.25, b= 0.55, lty="dashed", col="black")
> abline(a=3.98, b=0.55, lty="dotted", col="red")

It appears that the effects of physical and sexual abuse are additive. One does not make the effects of the other more severe (no interaction), which is in no way to say that they aren't each bad enough to begin with!

Same Aside

Here are the Type II (nonsequential) tests from the "car" package.

> Anova(lm.sexab1)
Anova Table (Type II tests)

Response: ptsd
          Sum Sq Df F value    Pr(>F)    
cpa       110.30  1  10.295  0.001982 ** 
csa       624.03  1  58.247 6.907e-11 ***
Residuals 782.08 73                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This changes nothing in the regression analysis, as regression effects are not calculated sequentially.

Rodriguez, N., Ryan, S. W., Vande Kemp, H., & Foy, D. W. (1997). Posttraumatic stress disorder in adult female survivors of childhood sexual abuse: A comparison study. Journal of Counseling and Clinical Psychology, 65, 53-59.

created 2016 February 18