PSYC 480 -- Dr. King
Analysis of Covariance (ANCOVA)

ANCOVA occurs when some of our IVs (predictors, explanatory variables) are
numeric and some are categorical (factors).

ANCOVA can be done either as regression or as ANOVA.

We will look at cases where there is one variable of each type (at least to begin).

The numeric variable is called the "numeric covariate." Oftentimes, the point of
including it is that it is "confounded" with the categorical IV, our real variable
of interest, and we want to remove that confound. At other times, it is just a
variable that is correlated with the DV, and we want to remove that variability
from the DV to make the test more sensitive to the categorical IV. At third times,
we just want to see the effects of both variables, as we would in a multiple
regression.

# tresselt3.txt
# These data were assembled by the Office of Institutional Research at CCU
# and were used by James Tresselt in his Psyc 497 (senior research) project,
# Fall 1995. The data are from a random sample of students who were admitted as
# Freshman in 1990, and who survived their freshman year. The variables are:
#     Gender: 0 = female, 1 = male
#     SATT: total SAT scores (SATV + SATQ) on the old scoring system
#     GPA91: college grade-point average two semesters after admission
# James' interest was in seeing the relationship between SAT and college GPA,
# but is there an interesting difference in this effect by Gender?

> rm(list=ls())
> file = "tresselt3.txt"   # on my Desktop; can also be retrived from the website
> TR = read.table(file=file, header=T)   # there are no string variables
> dim(TR)
[1] 251   3
> head(TR)
  Gender SATT GPA91
1      0  620 3.692
2      0  870 2.800
3      0  860 2.125
4      1 1070 1.308
5      0 1120 1.400
6      1  740 2.094
> summary(TR)
     Gender            SATT            GPA91      
 Min.   :0.0000   Min.   : 560.0   Min.   :0.071  
 1st Qu.:0.0000   1st Qu.: 770.0   1st Qu.:1.800  
 Median :0.0000   Median : 840.0   Median :2.308  
 Mean   :0.4701   Mean   : 855.5   Mean   :2.392  
 3rd Qu.:1.0000   3rd Qu.: 930.0   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :1290.0   Max.   :4.000  
> cor(TR)
            Gender       SATT      GPA91
Gender  1.00000000 0.06025216 -0.1964826
SATT    0.06025216 1.00000000  0.3330467
GPA91  -0.19648261 0.33304673  1.0000000
> 
> source("http://ww2.coastal.edu/kingw/psyc480/functions/rcrit.R")   # get a function
Error in file(filename, "r", encoding = encoding) : 
  cannot open the connection
> source("datasets_online/functions/rcrit.R")   # don't do this!     
> r.crit(df=249, alpha=.05)
      df    alpha   1-tail   2-tail 
249.0000   0.0500   0.1041   0.1239 

SATT-Gender: What kind of correlation? Is it significantly different from zero?
GPA91-Gender: What kind of correlation? Is it significantly different from zero?
GPA91-SATT: What kind of correlation? Is it significantly different from zero?

What does it mean that the GPA91-Gender correlation is negative?

We already know that the following t-test is going to find a significant difference
between mean GPA91 for men and women because the point-biserial correlation is
significantly different from zero. (It's effectively the same test.)

> t.test(GPA91 ~ Gender, data=TR, var.eq=T)

	Two Sample t-test

data:  GPA91 by Gender
t = 3.1621, df = 249, p-value = 0.001761
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1226691 0.5278557
sample estimates:
mean in group 0 mean in group 1 
       2.544669        2.219407

Women had a freshman GPA that was more than 3/10ths of a grade point higher than
that of men, and that difference is statistically significant.

Question: Why?

We can't answer that question from the data we have because our data are
observational and "why?" is a question about cause-and-effect. However, nobody
is going to put us in statistics prison if we suggest some possibilities (with
which our statistical result might be "consistent"). We can state a hypothesis
and then say our hypothesis was "supported" or "not supported", but hypotheses
are about the data (means, correlations, regression coefficients, etc.). These
statements are not about the data. BE CAREFUL what you say about them based on
a statistical analysis!

1) The women were smarter coming in.
2) The women worked harder once they got here.

What we need is a measure of previous smartness. What we have is SATT, which may
not be the best measure of smartness, but it's what we have. If we remove the
effect of SATT ("previous smartness") from GPA91, will the difference between
men and women still exist and will it still be significant. Let's get to it!

Note: I checked the interaction. It is not significant, p=0.990, so I'm dropping
it from the analysis. Any variability that does exist due to an interaction will
go... where?

> lm.out = lm(GPA91 ~ SATT + Gender, data=TR)
> summary(lm.out)

Call:
lm(formula = GPA91 ~ SATT + Gender, data = TR)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.41147 -0.48714 -0.01023  0.58918  1.65192 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.668671   0.323214   2.069 0.039599 *  
SATT         0.002212   0.000373   5.930 1.01e-08 ***
Gender      -0.359788   0.096635  -3.723 0.000244 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7627 on 248 degrees of freedom
Multiple R-squared:  0.158,	Adjusted R-squared:  0.1512 
F-statistic: 23.27 on 2 and 248 DF,  p-value: 5.491e-10

The coefficients: The Intercept still have the same meaning. It is the predicted
value of the DV when all the predictors are set to 0. The slope of the numeric
covariate is also still interpreted the same way. It is the predicted change in
the DV when the numeric covariate is changed by 1 and all other predictors are
held constant. When the categorical variable is dummy coded 0/1, the coefficient
can be thought of as the difference in the means of the two levels after adjusting
for the effect of the numeric covariate.

Gender is not only significant after "previous smartness" is removed, the effect
has actually gotten a little bigger. (Without SATT in the model, the coefficient
for Gender would have been -0.325262. How do we know that? You learned this quite
a while ago, so you may have to pick at your brain cells a little to find it.)
Now we can see that the adjusted mean difference is 0.360, and that is in favor
of the group coded 0 (because it is negative).

The regression equation is:
GPA91.hat = 0.668671 + 0.002212 * SATT - 0.359788 * Gender

This can be written as separate regression equations, one for each group.

For men:
GPA91.hat = 0.668671 + 0.002212 * SATT - 0.359788 * 1
          = 0.308883 + 0.002212 * SATT

For women:
GPA91.hat = 0.668671 + 0.002212 * SATT - 0.359788 * 0
          = 0.668671 + 0.002212 * SATT

The slope for SATT stays the same, but the intercept is different for men vs.
women. Why is that?

The problem could also have been done as an ANOVA. CAUTION: when doing these
problems using aov(), ALWAYS enter the numeric covariate first! Why?

> aov.out = aov(GPA91 ~ SATT + Gender, data=TR)
> summary(aov.out)
             Df Sum Sq Mean Sq F value   Pr(>F)    
SATT          1  19.01  19.006   32.67 3.12e-08 ***
Gender        1   8.06   8.064   13.86 0.000244 ***
Residuals   248 144.28   0.582                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice the p-value for SATT has changed, but the p-value for Gender has stayed
the same. Why?

How large is the Gender effect? (Careful. This is a trick question.)

Delta R-squared.

The aov() is a sequential analysis (or a hierarchical analysis). When the
analysis is done this way, it is customary to calculate how much R-squared
changes as each term is entered. These changes are called delta R-squared.

          Sum Sq   Delta R-squared
SATT       19.01   19.01 / 171.35 = 0.111
Gender      8.06    8.06 / 171.35 = 0.047
Resids    144.28
         -------
SS.Total  171.35

You should notice that the delta R-squared values add up to multiple R-squared
from the regression analysis using lm(). By this criterion, the effect of Gender
is small (but significant).

A scatterplot:
> plot(GPA91 ~ SATT, data=TR, pch=TR$Gender)
> abline(a=0.308883, b=0.002212)   # regression line for men
> abline(a=0.668671, b=0.002212, lty="dashed")   # regression line for women
> text(x=1180, y=0.5, "men are squares\nwomen are circles")
> title(main="Freshman GPA by Gender, Spring 1991")

AN IMPORTANT CAUTION: You could not do this if you have a categorical variable
coded numerically, which has more than two levels. Also, if a dummy coded
categorical variable is not coded 0/1, you have to be a lot more careful
with your interpretations. When James got his data from the Office of
Institutional Research, Gender was coded 0/5. Don't ask me why! Needless to
say, we changed it to a more sensible coding. Some people dummy code their
categorical variables 1/2. That's still not as convenient as 0/1 when it
comes to interpreting your statistical output. 0/1 is preferred.

PRACTICE PROBLEM.

Try this one.

> file = "http://ww2.coastal.edu/kingw/psyc480/data/hippocampus1.txt"
> hippo1 = read.table(file=file, header=T, stringsAsFactors=T)
> head(hippo1)
  volume age diagnosis
1    0.8  20    normal
2    2.4  23    normal
3    0.0  27    normal
4    0.6  29    normal
5    0.3  30    normal
6    0.9  31    normal
> summary(hippo1)
     volume             age         diagnosis 
 Min.   :-2.8000   Min.   :19.00   normal:49  
 1st Qu.:-0.8000   1st Qu.:36.00   schizo:50  
 Median :-0.1000   Median :52.00              
 Mean   :-0.1121   Mean   :50.81              
 3rd Qu.: 0.6000   3rd Qu.:66.00              
 Max.   : 2.7000   Max.   :82.00

These are data that I have "phonied up" a bit to remove an interaction that was
in the original data. We will look at the original data later. The data are from
a study of hippocampal size in normal control subjects vs. patients diagnosed
with schizophrenia. The hippocampus is a structure in the temporal lobe of the
brain that has to do with memory and spatial cognition. It has long been known
to be abnormal in schizophrenia. This study looked at how the volume of the
hippocampus changed with age in both normal controls and schizophrenics. Note:
for some reason, the researchers changed their hippocampal volumes to z-scores.
I'm not sure why, but it will make no difference in the effects that are seen.

The catch here is the categorical variable is not dummy coded. It is coded as a
factor. That means cor(hippo1) will fail because the data frame is not entirely
numeric. It also means you'll get a stange line in the output of your regression.
Instead of getting a line for "diagnosis," you'll get this line.

diagnosisschizo -0.798155   0.173925  -4.589 1.35e-05 ***

That is the effect on volume of having a diagnosis of schizophrenia. It's still
the difference between the two group means, but here you're being told specifically,
the mean for the schizophrenia group is lower (minus) by 0.798. (In what units?)

It will not change anything in the aov() output. In this case, the delta R-squared
for the categorical variable will be much larger than it is for the numeric
covariate. In fact, it is almost all of multiple R-squared. As a check on yourself,
delta R-squared for diagnosis should come out to 0.1796.

A SECOND PROBLEM.

# shyness.txt
# Elizabeth Ostop Psyc 497 Spring 2010
# SAD: score on Social Avoidance and Distress Scale (high=more distress)
# Shyness: score on Cheek and Buss Shyness Scale (high=more shyness)
# LOC: score on Rotter's Locus of Control Scale (high=external)
# Age: in years
# Sex: gender coded 0=Female, 1=Male

> file = "http://ww2.coastal.edu/kingw/psyc480/data/shyness.txt"
> shy = read.table(file=file, header=T)
> head(shy)
  SAD Shyness LOC Age Sex
1   5      39  13  22   1
2  14      36   9  23   0
3  16      48  18  21   0
4   1      36   3  19   0
5  13      57  19  21   0
6  13      40  16  20   0

These are Elizabeth Ostop's shyness data, which we've worked with before. Use
Shyness as the DV, SAD as the numeric covariate, and Sex as the categorical
variable. How big is the effect of Sex alone on Shyness? How big is it after
SAD is removed?