ANALYSIS OF COVARIANCE Some people view ANCOVA as being a multiple regression technique. Others see it as an offshoot of ANOVA. In R, ANCOVA can be done with equal success using either the lm() function or the aov() function. I find it easier to treat ANCOVA as regression, so that's what I'll do. Everything is regression, after all. The previous tutorial was called Multiple Regression, and we treated only cases where all the explanatory, or predictor, variables were numeric. It's possible to include categorical predictors (IVs) if they are coded correctly. In fact, if all we have are categorical IVs, then we are essentially doing ANOVA. If there is a mix of numeric and categorical IVs, then we are in the realm of ANCOVA. Gas Mileage A car with an eight cylinder engine... What teenaged boy hasn't dreamed of
owning one? Well, get one now before they become extinct, because they are
hard on gasoline! Of course, cars with big engines also tend to weigh more than
cars with smaller engines, so maybe it's not the number of cylinders in the
engine so much as it is the weight of the car that costs us at the gas pump. I
wonder how we might tease those two variables apart.
> names(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" > cars = mtcars[,c(1,2,6)] > cars$cyl = factor(cars$cyl) > summary(cars) mpg cyl wt Min. :10.40 4:11 Min. :1.513 1st Qu.:15.43 6: 7 1st Qu.:2.581 Median :19.20 8:14 Median :3.325 Mean :20.09 Mean :3.217 3rd Qu.:22.80 3rd Qu.:3.610 Max. :33.90 Max. :5.424The "mtcars" data set contains information on cars that were road-tested by Motor Trend magazine in 1974 (or road tests of which were published in 1974). That's a long time ago, unless you're me, and you're still dreaming of that 1967 Mustang! It's the data we have, so we'll go with it. You can update the data on your own time if you want to see if relationships we discover from 1974 still hold today. We have extracted three columns from that data frame, namely, mpg (gas
mileage in miles per U.S. gallon), cyl (the number of cylinders in the engine),
and wt (weight in 1000s of pounds). The "cyl" variable has sensibly been declared
a factor. Let's begin by looking at some graphs.
> par(mfrow=c(1,3)) > boxplot(mpg ~ cyl, data=cars, xlab="cylinders", ylab="mpg") > boxplot(wt ~ cyl, data=cars, xlab="cylinders", ylab="weight") > plot(mpg ~ wt, data=cars) An analysis of variance will no doubt show us in p-values what we can
already plainly see from looking at the boxplots.
> lm.out1 = lm(mpg ~ cyl, data=cars) # aov() works as lm() internally > anova(lm.out1) # same as summary.aov(lm.out1) Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) cyl 2 824.78 412.39 39.697 4.979e-09 *** Residuals 29 301.26 10.39 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(lm.out1) Call: lm(formula = mpg ~ cyl, data = cars) Residuals: Min 1Q Median 3Q Max -5.2636 -1.8357 0.0286 1.3893 7.2364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.6636 0.9718 27.437 < 2e-16 *** cyl6 -6.9208 1.5583 -4.441 0.000119 *** cyl8 -11.5636 1.2986 -8.905 8.57e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.223 on 29 degrees of freedom Multiple R-squared: 0.7325, Adjusted R-squared: 0.714 F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09No surprises in the ANOVA output. The interesting stuff is in the regression output. The (estimated) coefficients relate to the means of "cyl". The base, or reference, level of "cyl" is the one not listed with a coefficient in the regression output, or cyl4. The intercept (26.66) is the mean gas mileage for that level of "cyl". The coefficient for cyl6 is -6.92, which means that six cylinder cars get (got) 6.92 mpg less than four cylinder cars, on the average. Eight cylinder cars got 11.56 mpg less than four cylinder cars. The significance tests on these coefficients are equivalent to Fisher LSD tests on the differences in means. These figures are easily enough confirmed. (Important note: This interpretation of the coefficients depends upon having contrasts for unordered factors set to "treatment". Do options("contrasts") to check.) > with(cars, tapply(mpg, cyl, mean)) 4 6 8 26.66364 19.74286 15.10000The effect size is R^{2} = .7325. In the language of ANOVA, that statistic would be called eta-squared. That is the "proportion of explained variation" in mpg accounted for by the explanatory variable. I don't think there's any doubt of a cause-and-effect relationship in this case (maybe a wee bit), but this does not mean that number of cylinders explains the whole 73% of variability in gas mileage. A more accurate statement would be cylinders and all confounds from other variables that might be confounded with cylinders account for 73% of the variability in gas mileage. The last line of the regression output duplicates the ANOVA output, because there is only one explanatory variable at this time. We think one of the variables confounded with cylinders is weight of the
vehicle. In fact, we think once those gas mileages are adjusted for the weight
of the vehicle, the differences in the means by number of cylinders will be a
lot less impressive. Dare we imagine that the differences might vanish
altogether? I don't think so. But, let's see. Now for the analysis of covariance
part.
> lm.out2 = lm(mpg ~ wt + cyl, data=cars) > anova(lm.out2) Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) wt 1 847.73 847.73 129.6650 5.079e-12 *** cyl 2 95.26 47.63 7.2856 0.002835 ** Residuals 28 183.06 6.54 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(lm.out2) Call: lm(formula = mpg ~ wt + cyl, data = cars) Residuals: Min 1Q Median 3Q Max -4.5890 -1.2357 -0.5159 1.3845 5.7915 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.9908 1.8878 18.006 < 2e-16 *** wt -3.2056 0.7539 -4.252 0.000213 *** cyl6 -4.2556 1.3861 -3.070 0.004718 ** cyl8 -6.0709 1.6523 -3.674 0.000999 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.557 on 28 degrees of freedom Multiple R-squared: 0.8374, Adjusted R-squared: 0.82 F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11In the ANOVA, we have added "wt" (weight), a numeric variable that we think (i.e., know perfectly well in this case) is related to the DV as a covariate. Ordinarily, at least in the social sciences, we would do such a thing because we think the covariate will help discriminate among levels of the categorical IV, once the confound with the covariate is removed. Here we believe just the opposite is going to occur. We're not so terribly interested in the relationship between weight and gas mileage. We just want to get it out of the mix, so that we can see a "purer" relationship between cylinders and gas mileage. Two very important points need to be made at this juncture. First, notice we didn't ask to see any interactions between "wt" and "cyl". That means we're assuming there aren't any, a common assumption in ANCOVA, but also an assumption that is commonly incorrect. It is worth checking. Second, in R, the ANOVA tests are sequential. That means if we want weight taken out before we look at the effect of cylinders, we MUST enter it first into the model formula. This won't make any difference in the regression output, but it makes a BIG difference in the ANOVA output! Overall, we are now accounting for R^{2} = .8374, or 84% of the
total variability in gas mileage. However, notice that the part of the
explained variability that we can attribute to "cyl" (eta-squared for "cyl")
is now only...
> 95.26 / (847.73 + 95.26 + 183.06) [1] 0.0845966...or about 8%. The mighty have, indeed, fallen, but not so far that "cyl" is no longer a statistically significant effect. Are there other confounds with "cyl" that might remove that last 8%? I'll leave it up to you to look for them. We can now calculate adjusted means of "cyl" by level. This is done using the regression equation, and we have a different regression equation for each of the three levels of "cyl". mpg-hat = 33.9908 - 3.2056 * wt - 4.2556 for cyl6 mpg-hat = 33.9908 - 3.2056 * wt - 6.0709 for cyl8 The terms for specific levels of "cyl" can be combined with, and thus will change, the intercept of the regression line, and usually they are, but I have left them separated here to illustrate what's happening. To get the adjusted means for levels of "cyl", we fill in the overall average of "wt" and solve each of these regression equations. > mean(cars$wt) [1] 3.21725 > 33.9908 - 3.2056 * mean(cars$wt) # adjusted mean for cyl4 [1] 23.67758 > 33.9908 - 3.2056 * mean(cars$wt) - 4.2556 # adjusted mean for cyl6 [1] 19.42198 > 33.9908 - 3.2056 * mean(cars$wt) - 6.0709 # adjusted mean for cyl8 [1] 17.60668Notice that the means are still different--and significantly different as we can tell from the ANOVA--but not so different as they were before we took out the effect of the weight of the car on gas mileage. Here's a nice way to graph these results.
> plot(mpg ~ wt, data=cars, pch=as.numeric(cars$cyl), xlab="weight of car in 1000s of pounds") > abline(a=33.99, b=-3.21, lty="solid") > abline(a=33.99-4.26, b=-3.21, lty="dashed") > abline(a=33.99-6.07, b=-3.21, lty="dotted")The circles and solid line are for the four-cylinder cars, the triangles and dashed line for the six-cylinder cars, and the plus symbols and dotted line for the eight-cylinder cars. I'll refer you to the More Graphics tutorial for instructions on how to add all that information to the graph. Two things distress me here that I feel compelled to comment on. First,
notice there is very little overlap on the scatterplot in the weight of 4- vs.
6- vs. 8-cylinder cars. This makes the calculation of adjusted means just a
little bit dicey. Also, while I don't detect anything obvious on the graph, I'm
still uncomfortable that we haven't checked for the possibility of interactions
between "wt" and "cyl".
> lm.out3 = lm(mpg ~ wt * cyl, data=cars) > anova(lm.out2, lm.out3) Analysis of Variance Table Model 1: mpg ~ wt + cyl Model 2: mpg ~ wt * cyl Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 183.06 2 26 155.89 2 27.17 2.2658 0.1239Interactions make no significant contribution to the model, but I'd be happier if that F-value were closer to 1. (Note: This method of comparing models is explained in the Multiple Regression tutorial.) An Aside Above I noted that, in R, ANOVA is done sequentially, and that makes it
very important to get the covariate into the model formula first. If you don't
care for sequential sums of squares, you can try this, provided you have
installed the optional "car" package from CRAN. (See
Package Management.)
> library("car") # this will give an error if "car" is not installed > Anova(lm.out2) # notice the upper case A in Anova Anova Table (Type II tests) Response: mpg Sum Sq Df F value Pr(>F) wt 118.204 1 18.0801 0.000213 *** cyl 95.263 2 7.2856 0.002835 ** Residuals 183.059 28 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1These are so-called Type II sums of squares, and they are not calculated sequentially. I would hesitate to calculate eta-squares from this table unless I got the total SS by some method other than adding the SSs in this table. (See the tutorial on Unbalanced Designs for a discussion.) Childhood Abuse and Adult Post-Traumatic Stress Disorder The data for this section of the tutorial is discussed in Faraway (2005). To get the data, you can either install the "faraway" package, or you can go to Dr. Faraway's website and download the data files that accompany his book. The files will download into a folder called "jfdata.zip", which will unzip into a folder called "ascdata". Rename this folder "Faraway" and drop it into your working directory (Rspace). If you have installed the "faraway" package, the commands to get the data
are:
> library("faraway") > data(sexab)If you have downloaded the Faraway folder to your working directory, then do this: > sexab = read.table("Faraway/sexab.txt", header=T, row.names=1)Either way... > summary(sexab) cpa ptsd csa Min. :-3.1204 Min. :-3.349 Abused :45 1st Qu.: 0.8321 1st Qu.: 6.173 NotAbused:31 Median : 2.0707 Median : 8.909 Mean : 2.3547 Mean : 8.986 3rd Qu.: 3.7387 3rd Qu.:12.240 Max. : 8.6469 Max. :18.993The data are from a study by Rodriquez, et al. (1997). The subjects were adult women who were being treated at a clinic, some of whom self-reported childhood sexual abuse (csa=Abused), and some of whom did not (csa=NotAbused). All of the women filled out two standardized scales, one assessing the severity of childhood physical abuse (cpa), and one assessing the severity of adult post-traumatic stress disorder (ptsd). On both of these scales, a higher score indicates a worse outcome. The question we will attempt to answer is, How are sexual and physical abuse that occur in childhood related to the severity of PTSD as an adult, and how does the effect of physical abuse differ for women who were or were not sexually abused as children? When I discuss this example in my class, I always start off by having
students examine the following scatterplot. (DON'T close this graphics window.
We will be drawing regression lines on this graph eventually.)
> plot(ptsd ~ cpa, data=sexab, pch=as.numeric(csa), col=as.numeric(csa))
First, I propose we check for the importance of interactions.
> lm.sexab1 = lm(ptsd ~ cpa + csa, data=sexab) > lm.sexab2 = lm(ptsd ~ cpa * csa, data=sexab) > anova(lm.sexab1, lm.sexab2) Analysis of Variance Table Model 1: ptsd ~ cpa + csa Model 2: ptsd ~ cpa * csa Res.Df RSS Df Sum of Sq F Pr(>F) 1 73 782.08 2 72 774.28 1 7.8069 0.726 0.397I like that. The F-value is near 1, so we can assume the absence of important interactions. That will make our analysis easier. > anova(lm.sexab1) # sequential tests Analysis of Variance Table Response: ptsd Df Sum Sq Mean Sq F value Pr(>F) cpa 1 449.80 449.80 41.984 9.462e-09 *** csa 1 624.03 624.03 58.247 6.907e-11 *** Residuals 73 782.08 10.71 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(lm.sexab1) Call: lm(formula = ptsd ~ cpa + csa, data = sexab) Residuals: Min 1Q Median 3Q Max -8.1567 -2.3643 -0.1533 2.1466 7.1417 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.2480 0.7187 14.260 < 2e-16 *** cpa 0.5506 0.1716 3.209 0.00198 ** csaNotAbused -6.2728 0.8219 -7.632 6.91e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.273 on 73 degrees of freedom Multiple R-squared: 0.5786, Adjusted R-squared: 0.5671 F-statistic: 50.12 on 2 and 73 DF, p-value: 2.002e-14We have an answer. The estimated coefficient for csaNotAbused is statistically significant. When we include it in the regression equation(s), it will result in different intercepts for the csaAbused and csaNotAbused lines. Furthermore, the difference in intercepts, 6.3 points on a scale with a range of scores of a little more than 22 points, is quite large. The regression equations are: ptsd-hat(NotAbused) = 10.2480 + 0.5506 * cpa - 6.2728 = 3.9752 + 0.5506 * cpa To draw the regression lines on the pre-existing scatterplot...
> abline(a=10.25, b= 0.55, lty="dashed", col="black") > abline(a=3.98, b=0.55, lty="dotted", col="red") Here are the Type II (nonsequential) tests from the "car" package.
> Anova(lm.sexab1) Anova Table (Type II tests) Response: ptsd Sum Sq Df F value Pr(>F) cpa 110.30 1 10.295 0.001982 ** csa 624.03 1 58.247 6.907e-11 *** Residuals 782.08 73 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1This changes nothing in the regression analysis, as regression effects are not calculated sequentially.
created 2016 February 18 |