As I've mentioned previously, ANOVA is just a special case of regression analysis, the case in which all the IVs are categorical. Since we will need to know how to deal with categorical IVs when we do regression analysis, I've prepared what I hope will be a brief, but IMPORTANT, lesson.
We will begin by looking at the marijuana data from Scott Keats' Psyc 497 project, which we've seen before. Recall that the IV was whether or not the subject uses marijuana, and the DV was number correct on the digit span test.
> rm(list=ls()) # NOTE: this will clear your workspace; DON'T do it if you > # don't want your workspace cleared; use rm() instead > file = "http://ww2.coastal.edu/kingw/psyc480/data/marijuana.txt" > MJ = read.table(file=file, header=T, stringsAsFactors=T) > summary(MJ) # output not shown; look at your screen
Fill in the following table of descriptive statistics.
nonsmokers | smokers | |
mean | ||
SD | ||
n |
Can you calculate the pooled standard deviation? Hint: SS = sd^2 * df
sd.pooled =
Can you calculate the standard error of the difference between the means? Hint: Come on! Okay, se.meandiff = sd.pooled * sqrt(1/n1 + 1/n2)
sd.meandiff =
Two levels of the independent variable means a t-test, so let's have a look at that, and then we will do the same thing by regression.
> t.test(score ~ group, data=MJ, var.eq=T) # pooled variance t-test Two Sample t-test data: score by group t = 2.6659, df = 18, p-value = 0.01575 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.6145934 5.1854066 sample estimates: mean in group nonsmoker mean in group smoker 20.1 17.2
You're familiar with all of that, of course. Now let's see how much of it regression gives us. The function for regression is lm(), which has the same syntax as aov().
> lm.out = lm(score ~ group, data=MJ) > summary(lm.out) Call: lm(formula = score ~ group, data = MJ) Residuals: Min 1Q Median 3Q Max -4.200 -2.125 0.350 1.900 3.800 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.1000 0.7692 26.131 9.13e-16 *** groupsmoker -2.9000 1.0878 -2.666 0.0158 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.432 on 18 degrees of freedom Multiple R-squared: 0.2831, Adjusted R-squared: 0.2432 F-statistic: 7.107 on 1 and 18 DF, p-value: 0.01575
We will skip over the Residuals information for now. In this context, Residuals are how much the individual scores differ from their group means.
Now we'll look at the Coefficients table, and I'm going to let you figure most of it out, since these numbers are things you've already calculated.
The value in the Estimate column for (Intercept) is 20.1. What is this?
The value in the Estimate column for groupsmoker is -2.9. What is this? (A little bit harder, but keep at it.)
When R does regression with grouped data, it will pick one of the groups as the baseline group. Since we didn't tell it to do anything else, it chose the one that comes first alphabetically, nonsmoker. The mean for that group is 20.1, and the t-test that follows on that line is a single-sample t-test testing the null hypothesis mu = 0, which is meaningless in this example.
The next line in the table is for the group that wasn't chosen as the baseline group, in this case groupsmoker. It's mean is 2.9 points LESS than the mean of the baseline group, or meandiff = -2.9. In other words, the regression is telling us that being a marijuana smoker costs you 2.9 points off your expected (or mean) digit span score.
The next number on this line is 1.0878. What is this?
That is followed by the t-test, which is an independent samples, pooled variance t-test, just like the one we did above. It got the same t-value (except negative because the subtraction in the numerator was in the opposite direction), and notice, with some rounding difference, the p-value is the same. There is your t-test.
Now look at the value labeled Residual standard error. What is this?
So now that you have the difference in the means, and you have the pooled
standard deviation, can you quickly get a value for Cohen's d?
Multiple R-squared we'll skip over for now, but for the painfully curious, it is a measure of effect size related to eta-squared. The ANOVA result on the last line is the t-test done over again, but this time as an analysis of variance. Notice that the p-value is the same, and F = t2.
Now on to an ANOVA example. We will use Tom Prin's firemen data for this example. Recall that we have Rotter Locus of Control scores for firemen who have been evaluated with a rating that gives their likelihood of engaging in "meritorous" behavior, like rushing into a burning building to rescue someone.
> ls() [1] "getData" "lm.out" "MJ" "SS" > rm(lm.out,MJ) > getData("firemen.txt") # or do it the hard way if you don't still have this function > fire = X > rm(X) > summary(fire) Rotter Area Risk Min. : 4.00 Charleston:25 A:26 1st Qu.: 8.00 Horry :25 B:33 Median :10.00 NYC :25 C:16 Mean : 9.96 3rd Qu.:12.00 Max. :16.00
Put descriptive statistics in the following table.
risk A | risk B | risk C | |
mean | |||
SD | |||
n |
Can you calculate the pooled standard deviation?
sd.pooled =
Now for the ANOVA.
> aov.out = aov(Rotter ~ Risk, data=fire) > summary(aov.out) Df Sum Sq Mean Sq F value Pr(>F) Risk 2 47.5 23.733 2.919 0.0604 . Residuals 72 585.4 8.131 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Can you calculate a value for eta-squared?
eta.squared =
Just for giggles, let's get the p-values from Fisher LSD tests.
> with(fire, pairwise.t.test(Rotter,Risk,"none")) Pairwise comparisons using t tests with pooled SD data: Rotter and Risk A B B 0.099 - C 0.023 0.331 P value adjustment method: none
Are these Fisher LSD tests justified?
Why not?
You know what's coming next, right? Hopefully, you've even beaten me to it!
> lm.out = lm(Rotter ~ Risk, data=fire) > summary(lm.out) Call: lm(formula = Rotter ~ Risk, data = fire) Residuals: Min 1Q Median 3Q Max -5.2121 -1.5868 -0.0625 1.7879 7.0385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.9615 0.5592 16.025 <2e-16 *** RiskB 1.2506 0.7477 1.672 0.0988 . RiskC 2.1010 0.9060 2.319 0.0232 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.851 on 72 degrees of freedom Multiple R-squared: 0.075, Adjusted R-squared: 0.04931 F-statistic: 2.919 on 2 and 72 DF, p-value: 0.06041
We'll start on the bottom line with the ANOVA. Is the ANOVA result the same?
(Same F-value, degrees of freedom, and p-value?)
What is Multiple R-squared (i.e., same as what value we've calculated previouisly)?
What is Residual standard error (i.e., same as what value we've calculated previouisly)?
Now we'll look at the Coefficients table. Which group does R consider the
baseline group in this analysis?
What is the mean of the baseline group?
Is that group mean significantly different from zero?
The next two lines give us results from Risk groups B and C, tells us first the difference in means between that group and the baseline group, and then gives the result of a Fisher LSD test on that mean difference.
What was the difference between the Risk group B mean and the Risk group A
mean, and was the Risk group B mean higher or lower than the Risk group A mean?
Was the difference between the Risk group B mean and the Risk group A
mean significantly different by Fisher LSD test? What was the p-value?
Of course, we shouldn't be doing Fisher LSD tests here. Why is that again? Because the null hypothesis was not rejected in the ANOVA. The Fisher LSD test must be protected if it's to give us a valid result as a post hoc test. Nevertheless, let us continue.
What was the difference between the Risk group C mean and the Risk group A
mean, and was the Risk group C mean higher or lower than the Risk group A mean?
Was the difference between the Risk group C mean and the Risk group A
mean significantly different by Fisher LSD test? What was the p-value?
Is the Fisher LSD test given between Risk group B and Risk group C?
You can practice this on any of the ANOVA practice problems that were posted earlier. I especially recommend that you try it with szeszko.txt.