PSYC 480 -- Dr. King Analysis of Covariance (ANCOVA) ANCOVA occurs when some of our IVs (predictors, explanatory variables) are numeric and some are categorical (factors). ANCOVA can be done either as regression or as ANOVA. We will look at cases where there is one variable of each type (at least to begin). The numeric variable is called the "numeric covariate." Oftentimes, the point of including it is that it is "confounded" with the categorical IV, our real variable of interest, and we want to remove that confound. At other times, it is just a variable that is correlated with the DV, and we want to remove that variability from the DV to make the test more sensitive to the categorical IV. At third times, we just want to see the effects of both variables, as we would in a multiple regression. # tresselt3.txt # These data were assembled by the Office of Institutional Research at CCU # and were used by James Tresselt in his Psyc 497 (senior research) project, # Fall 1995. The data are from a random sample of students who were admitted as # Freshman in 1990, and who survived their freshman year. The variables are: # Gender: 0 = female, 1 = male # SATT: total SAT scores (SATV + SATQ) on the old scoring system # GPA91: college grade-point average two semesters after admission # James' interest was in seeing the relationship between SAT and college GPA, # but is there an interesting difference in this effect by Gender? > rm(list=ls()) > file = "tresselt3.txt" # on my Desktop; can also be retrived from the website > TR = read.table(file=file, header=T) # there are no string variables > dim(TR) [1] 251 3 > head(TR) Gender SATT GPA91 1 0 620 3.692 2 0 870 2.800 3 0 860 2.125 4 1 1070 1.308 5 0 1120 1.400 6 1 740 2.094 > summary(TR) Gender SATT GPA91 Min. :0.0000 Min. : 560.0 Min. :0.071 1st Qu.:0.0000 1st Qu.: 770.0 1st Qu.:1.800 Median :0.0000 Median : 840.0 Median :2.308 Mean :0.4701 Mean : 855.5 Mean :2.392 3rd Qu.:1.0000 3rd Qu.: 930.0 3rd Qu.:3.000 Max. :1.0000 Max. :1290.0 Max. :4.000 > cor(TR) Gender SATT GPA91 Gender 1.00000000 0.06025216 -0.1964826 SATT 0.06025216 1.00000000 0.3330467 GPA91 -0.19648261 0.33304673 1.0000000 > > source("http://ww2.coastal.edu/kingw/psyc480/functions/rcrit.R") # get a function Error in file(filename, "r", encoding = encoding) : cannot open the connection > source("datasets_online/functions/rcrit.R") # don't do this! > r.crit(df=249, alpha=.05) df alpha 1-tail 2-tail 249.0000 0.0500 0.1041 0.1239 SATT-Gender: What kind of correlation? Is it significantly different from zero? GPA91-Gender: What kind of correlation? Is it significantly different from zero? GPA91-SATT: What kind of correlation? Is it significantly different from zero? What does it mean that the GPA91-Gender correlation is negative? We already know that the following t-test is going to find a significant difference between mean GPA91 for men and women because the point-biserial correlation is significantly different from zero. (It's effectively the same test.) > t.test(GPA91 ~ Gender, data=TR, var.eq=T) Two Sample t-test data: GPA91 by Gender t = 3.1621, df = 249, p-value = 0.001761 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.1226691 0.5278557 sample estimates: mean in group 0 mean in group 1 2.544669 2.219407 Women had a freshman GPA that was more than 3/10ths of a grade point higher than that of men, and that difference is statistically significant. Question: Why? We can't answer that question from the data we have because our data are observational and "why?" is a question about cause-and-effect. However, nobody is going to put us in statistics prison if we suggest some possibilities (with which our statistical result might be "consistent"). We can state a hypothesis and then say our hypothesis was "supported" or "not supported", but hypotheses are about the data (means, correlations, regression coefficients, etc.). These statements are not about the data. BE CAREFUL what you say about them based on a statistical analysis! 1) The women were smarter coming in. 2) The women worked harder once they got here. What we need is a measure of previous smartness. What we have is SATT, which may not be the best measure of smartness, but it's what we have. If we remove the effect of SATT ("previous smartness") from GPA91, will the difference between men and women still exist and will it still be significant. Let's get to it! Note: I checked the interaction. It is not significant, p=0.990, so I'm dropping it from the analysis. Any variability that does exist due to an interaction will go... where? > lm.out = lm(GPA91 ~ SATT + Gender, data=TR) > summary(lm.out) Call: lm(formula = GPA91 ~ SATT + Gender, data = TR) Residuals: Min 1Q Median 3Q Max -2.41147 -0.48714 -0.01023 0.58918 1.65192 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.668671 0.323214 2.069 0.039599 * SATT 0.002212 0.000373 5.930 1.01e-08 *** Gender -0.359788 0.096635 -3.723 0.000244 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7627 on 248 degrees of freedom Multiple R-squared: 0.158, Adjusted R-squared: 0.1512 F-statistic: 23.27 on 2 and 248 DF, p-value: 5.491e-10 The coefficients: The Intercept still have the same meaning. It is the predicted value of the DV when all the predictors are set to 0. The slope of the numeric covariate is also still interpreted the same way. It is the predicted change in the DV when the numeric covariate is changed by 1 and all other predictors are held constant. When the categorical variable is dummy coded 0/1, the coefficient can be thought of as the difference in the means of the two levels after adjusting for the effect of the numeric covariate. Gender is not only significant after "previous smartness" is removed, the effect has actually gotten a little bigger. (Without SATT in the model, the coefficient for Gender would have been -0.325262. How do we know that? You learned this quite a while ago, so you may have to pick at your brain cells a little to find it.) Now we can see that the adjusted mean difference is 0.360, and that is in favor of the group coded 0 (because it is negative). The regression equation is: GPA91.hat = 0.668671 + 0.002212 * SATT - 0.359788 * Gender This can be written as separate regression equations, one for each group. For men: GPA91.hat = 0.668671 + 0.002212 * SATT - 0.359788 * 1 = 0.308883 + 0.002212 * SATT For women: GPA91.hat = 0.668671 + 0.002212 * SATT - 0.359788 * 0 = 0.668671 + 0.002212 * SATT The slope for SATT stays the same, but the intercept is different for men vs. women. Why is that? The problem could also have been done as an ANOVA. CAUTION: when doing these problems using aov(), ALWAYS enter the numeric covariate first! Why? > aov.out = aov(GPA91 ~ SATT + Gender, data=TR) > summary(aov.out) Df Sum Sq Mean Sq F value Pr(>F) SATT 1 19.01 19.006 32.67 3.12e-08 *** Gender 1 8.06 8.064 13.86 0.000244 *** Residuals 248 144.28 0.582 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Notice the p-value for SATT has changed, but the p-value for Gender has stayed the same. Why? How large is the Gender effect? (Careful. This is a trick question.) Delta R-squared. The aov() is a sequential analysis (or a hierarchical analysis). When the analysis is done this way, it is customary to calculate how much R-squared changes as each term is entered. These changes are called delta R-squared. Sum Sq Delta R-squared SATT 19.01 19.01 / 171.35 = 0.111 Gender 8.06 8.06 / 171.35 = 0.047 Resids 144.28 ------- SS.Total 171.35 You should notice that the delta R-squared values add up to multiple R-squared from the regression analysis using lm(). By this criterion, the effect of Gender is small (but significant). A scatterplot: > plot(GPA91 ~ SATT, data=TR, pch=TR$Gender) > abline(a=0.308883, b=0.002212) # regression line for men > abline(a=0.668671, b=0.002212, lty="dashed") # regression line for women > text(x=1180, y=0.5, "men are squares\nwomen are circles") > title(main="Freshman GPA by Gender, Spring 1991") AN IMPORTANT CAUTION: You could not do this if you have a categorical variable coded numerically, which has more than two levels. Also, if a dummy coded categorical variable is not coded 0/1, you have to be a lot more careful with your interpretations. When James got his data from the Office of Institutional Research, Gender was coded 0/5. Don't ask me why! Needless to say, we changed it to a more sensible coding. Some people dummy code their categorical variables 1/2. That's still not as convenient as 0/1 when it comes to interpreting your statistical output. 0/1 is preferred. PRACTICE PROBLEM. Try this one. > file = "http://ww2.coastal.edu/kingw/psyc480/data/hippocampus1.txt" > hippo1 = read.table(file=file, header=T, stringsAsFactors=T) > head(hippo1) volume age diagnosis 1 0.8 20 normal 2 2.4 23 normal 3 0.0 27 normal 4 0.6 29 normal 5 0.3 30 normal 6 0.9 31 normal > summary(hippo1) volume age diagnosis Min. :-2.8000 Min. :19.00 normal:49 1st Qu.:-0.8000 1st Qu.:36.00 schizo:50 Median :-0.1000 Median :52.00 Mean :-0.1121 Mean :50.81 3rd Qu.: 0.6000 3rd Qu.:66.00 Max. : 2.7000 Max. :82.00 These are data that I have "phonied up" a bit to remove an interaction that was in the original data. We will look at the original data later. The data are from a study of hippocampal size in normal control subjects vs. patients diagnosed with schizophrenia. The hippocampus is a structure in the temporal lobe of the brain that has to do with memory and spatial cognition. It has long been known to be abnormal in schizophrenia. This study looked at how the volume of the hippocampus changed with age in both normal controls and schizophrenics. Note: for some reason, the researchers changed their hippocampal volumes to z-scores. I'm not sure why, but it will make no difference in the effects that are seen. The catch here is the categorical variable is not dummy coded. It is coded as a factor. That means cor(hippo1) will fail because the data frame is not entirely numeric. It also means you'll get a stange line in the output of your regression. Instead of getting a line for "diagnosis," you'll get this line. diagnosisschizo -0.798155 0.173925 -4.589 1.35e-05 *** That is the effect on volume of having a diagnosis of schizophrenia. It's still the difference between the two group means, but here you're being told specifically, the mean for the schizophrenia group is lower (minus) by 0.798. (In what units?) It will not change anything in the aov() output. In this case, the delta R-squared for the categorical variable will be much larger than it is for the numeric covariate. In fact, it is almost all of multiple R-squared. As a check on yourself, delta R-squared for diagnosis should come out to 0.1796. A SECOND PROBLEM. # shyness.txt # Elizabeth Ostop Psyc 497 Spring 2010 # SAD: score on Social Avoidance and Distress Scale (high=more distress) # Shyness: score on Cheek and Buss Shyness Scale (high=more shyness) # LOC: score on Rotter's Locus of Control Scale (high=external) # Age: in years # Sex: gender coded 0=Female, 1=Male > file = "http://ww2.coastal.edu/kingw/psyc480/data/shyness.txt" > shy = read.table(file=file, header=T) > head(shy) SAD Shyness LOC Age Sex 1 5 39 13 22 1 2 14 36 9 23 0 3 16 48 18 21 0 4 1 36 3 19 0 5 13 57 19 21 0 6 13 40 16 20 0 These are Elizabeth Ostop's shyness data, which we've worked with before. Use Shyness as the DV, SAD as the numeric covariate, and Sex as the categorical variable. How big is the effect of Sex alone on Shyness? How big is it after SAD is removed?