Psyc 480 -- Dr. King

Exercise: More Post Hoc Tests

Liz Morris (Psyc 497, Fall 2005) tested the experimental hypothesis that students' evaluations of their instructors are influenced by the instructor's appearance. She randomly divided a sample of CCU students into three groups. All groups were shown a biology lecture and a photo of the lecturer, and after having time to inspect these materials, the students were asked to evaluate the instructor on a standard student evaluation of teaching form. These data are from the overall evaluation of the instructor. (Note: in those days the scale went from 1 to 7.) The IV was created when one group was shown a photo of an instructor with long hair, one with medium length hair, and one with short hair. It was the same photo in all cases but photoshopped to alter the hair length.

Here are her data. Once again, numbers are students' overall evaluations of teaching on a scale of 1 to 7, with 7 being best.

long hair
5 2 5 4 3 5 4 6 6 6 2 4 5 4 6 3 5 3 5 3 4 5 4 4 5 4 3 7 4
medium hair
5 5 6 5 4 1 6 4 5 6 6 4 5 4 6 5 4 5 6 3 5 5 5 5 6 4 4 6
short hair
4 3 3 4 3 3 3 4 4 3 5 3 6 2 5 4 6 6 6 5 4 3 3 3 2 5 3

Design Question 1: Was this a true (designed, randomized) experiment Why or why not?

Design Question 2: What does that mean for our interpreatation of the results?

Design Question 3: What was the independent variable?

Design Question 3a: What were the levels of the independent variable? (In other words, what were the names of the groups?)

Design Question 4: What was the dependent variable?

Design Question 5: Since the groups are not all the same size, how would this design be described?

Type the null hypothesis here:

Let's begin by doing the ANOVA. The aov() function cannot deal with these data if we have them in three separate vectors (or variables). We need all of the values of the DV (those numbers) in one variable. Then we need an IV (grouping variable) that tells which group, long, medium, or short, each of those values is from. Ordinarily, we would have those two variables inside a data frame, BUT they do not have to be in a data frame. We can put them in the workspace.

Open R. Clear the workspace. First, we'll create the DV, which we will call evaluations. Since those numbers are separated by white space, we will use scan() to do it, but we're going to have to be careful. These are the steps you should follow.

Type evaluations = scan()
Press the Enter (or return) key once.
Copy and paste the first line of data into scan at the 1: prompt.
Press the Enter key ONCE!
Copy and paste the second line of data.
Press the Enter key ONCE.
Copy and paste the third line of data.
Press the Enter key twice.
Use summary() to summarize the evaluations variable.

You should be looking at this right now.

> evaluations = scan()
1: 5 2 5 4 3 5 4 6 6 6 2 4 5 4 6 3 5 3 5 3 4 5 4 4 5 4 3 7 4
30: 5 5 6 5 4 1 6 4 5 6 6 4 5 4 6 5 4 5 6 3 5 5 5 5 6 4 4 6
58: 4 3 3 4 3 3 3 4 4 3 5 3 6 2 5 4 6 6 6 5 4 3 3 3 2 5 3
85: 
Read 84 items
> summary(evaluations)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.000   4.357   5.000   7.000

That's simple. We've done that before. Now we're going to have to learn something new. Fill in these boxes: How many subjects were in the long hair group? How many subjects were in the medium hair group? How many subjects were in the short hair group?

So we need an IV that says long (or Long, let's make it) 29 times, Medium 28 times, and Short 27 times. Who volunteers to type that? The problem is, if we make one spelling error, say Mediun instead of Medium, or even if we leave a space on the end of one of those words one time ("Medium "), we've got a problem. It's not a problem that can't be fixed, but it would be better if we don't make the mistake to begin with. Enter the rep() function, which is short for repeat. Type this in R, and press Enter. Be careful of your punctuation.

> hair.length = rep(c("Long","Medium","Short"), times=c(29,28,27))
> summary(hair.length)
   Length     Class      Mode 
       84 character character

Well, that doesn't look at all like what we want! The rep() function creates a character variable. It does not create an IV or factor. That's easy enough to fix.

> hair.length = factor(hair.length)
> summary(hair.length)
  Long Medium  Short 
    29     28     27

Now we're cookin' with gas! Those two variables are in our workspace, not in a data frame. Take a look.

> ls()
[1] "evaluations" "getData"     "hair.length"

Note: You don't need getData. I haven't removed it yet from my workspace. Now, you're probably wondering how things are going to be harder because we don't have a data frame. The good news is, they're not. They're going to be easier. Because those variables are in our workspace, we can address them directly. We don't need any of that $ or with() or data= business. We do need to be careful not to accidently alter, or even erase, any of those variables (easier than you might think). So, with reasonable care, do the ANOVA.

> aov.out = aov(evaluations ~ hair.length)
> summary(aov.out)

I'm not going to show you the summary table. Get it yourself!

What is the p-value?

What is our decision regarding the null hypothesis?

What does that mean regarding our sample means?

Aside: Students sometimes have a problem with decimal numbers that are less than one. If you're uncertain if the p-value is less than .05, just ask R.

> 0.0184 < .05
[1] TRUE

This is like saying, "Hey R. Is 0.0184 less than .05?" R says TRUE (yes).

We've gotten a little ahead of ourselves. We should have done some descriptive statistics first, but I'll tell ya something. If this were your experiment, and these were your data, you'd be jumping in there and doing the ANOVA first, too!

Question 1. Why won't the sample means be the same, even if the null hypothesis is true?

Fill in the following table of descriptive statistics. Type a number in the box before clicking the check button. The check button will give you the answer, but let's not get too lazy. I had to do it, so you can bet I'm going to make you do it, and probably pretty soon! Round all decimal answers to three decimal places. I'll get you started on the first one, and I'll show you a new "trick" for sum of squares.

> tapply(evaluations, hair.length, length)   # no need for $ or with()
  Long Medium  Short 
    29     28     27

There is no sum of squares function in the R base packages, an oversite in my opinion. So we're going to write our own function. If you can calculate it at the command line, you can write a function for it. There are several ways that SS can be calculated at the command line, but we're going to use the following one.

> SS = function(X) sum(X^2) - sum(X)^2 / length(X)

And that's all there is to writing a function. Give it a name. Type "= function(X)" (no quotes). Type at least one space, then type out the calculation that would calculate that function on the variable X. Press Enter, of course. That puts the function in your workspace, where it will stay until you remove it.

> ls()
[1] "aov.out"     "evaluations" "getData"     "hair.length" "SS"

It can be used just like any other function.

> tapply(evaluations, hair.length, SS)   # output not shown (get it yourself)

	Long	Medium	Short
n
mean
SS
var

Question 2. Which of the three groups of scores is the least variable?

Question 2a. Why can't we compare the sums of squares (SSes)?

Question 2b. Why can we compare the variances?

Question 3. The sample variances are not the same. Do you think there is a problem with the homogeneity of variance assumpion?

We have done the ANOVA, and we have rejected the null hypothesis. Therefore, post hoc tests are justified to find out which group means are significantly different. One way to do that is with the Tukey HSD test. Another way to do it is with multiple t-tests. You were probably told in your first stat course that this is a bad idea, something about experimentwise error rate (i.e., multiple significance tests raise the effective alpha level of the tests). Don't worry. We're going to make it okay.

The first thing we need is a pooled variance. Which one, you ask. There are three groups, so three ways to pair them up two at a time for t-tests. But we are assuming homogeneity of variance across the three groups, so it turns out there is only one pooled variance, the one that combines all three groups into one "pool." Do you remember how to do it?

First we need to get the unexplained or error variability. We do that by adding the sums of squares of the three groups: SS(Long) + SS(Medium) + SS(Short).
Question 4. The result is:

Hint: try this.

> sum(tapply(evaluations, hair.length, SS))

The result producted by tapply() is a vector, so just like any other vector, it can be summed. Now, where have you seen that number before, rounded a little differently.

Then we need the error degrees of freedom. Remember the rule? N-k, total number of subjects minus the number of groups we've divided them into.
Question 5. The result is:

Then the pooled variance is the error variability, SSE, divided by error degrees of freedom, dfE.
Question 6. The result is:

Where have you seen this number before?

Stop and check! The pooled variance is an average variance and, therefore, should be like the group variances. Its value should be between the values of the lowest and highest group variances. Is it? If not, go back and find your mistake.

Question 7. If that is the pooled variance, then what is the pooled standard deviation?

This is the pooled SD that we will use for all three of our comparisons, even though we will only be comparing two of the groups at a time, which is called pairwise comparisons. We get another bonus, too. Because the pooled SD was calculated from all three groups, we get to use the full error degrees of freedom for our t-test. Hopefully, you wrote it in one of those boxes above. In other words, if we were doing a simple independent t-test between the Long and Medium groups, what would df be? But because we are doing pairwise comparisons with three groups and assuming homogeneity of variance, when we do the Long-Medium comparison, we get to use as the df. (You shouldn't fail to notice that this will make the critical value of t smaller, which is a good thing.)

You know, I'm feeling a little nostalgic today, so let's get a critical value of t from a t table. This critical value will work for all three tests.

As always, we'll set alpha = .05, so scan across the top margin of the table until you find a two-tailed probability of .05. Then scan down that column until you come to the row with the correct number of degrees of freedom (left margin). The correct df for our test is not there, so use the next lower value.

Question 8: What is the value of t.crit for all three tests?
t.crit =

Let's compare Long vs. Medium first. The difference between the means is

4.821 - 4.345 = 0.476

The standard error of the mean difference is obtained in the usual way.

se.meandiff = sd.pooled * sqrt(1/n1 + 1/n2)

Question 9. What is the value of the standard error of the mean difference FOR THIS COMPARISON?

Finally, we get t by dividing the mean difference by its standard error.

Question 10. What is the value of t?

Question 11. Were the means in these two conditions significantly different?

Let's quickly do Long vs. Short, and then I'll let you do the last one for yourself.

difference between the means	4.345 - 3.889 = 0.456
pooled standard deviation	1.193 (from above)
standard error of the mean difference	1.193 * sqrt(1/29 + 1/27) = 0.319
calculated value of t	0.456 / 0.319 = 1.429
decision regarding the null	fail to reject

Now you do Medium vs. Short.

difference between the means
pooled standard deviation
standard error of the mean difference
calculated value of t
decision regarding the null

If you got t = 2.894 (or thereabouts) and rejected the null hypothesis, then you did it right. (The mean difference was 0.932 and the standard error was 0.322.)

This version of the t-test that we've been doing is called the Fisher LSD test, and before you get yourself all in a fizzle about it, LSD stands for "least significant difference" (a reference to the way the test used to be calculated in days of yore--meaning when I learned it). R does not have a function for the Fisher LSD test as such (at least not in the base package), but it will do us the favor of calculating the p-values for us.

> pairwise.t.test(evaluations, hair.length, p.adjust.method="none")

	Pairwise comparisons using t tests with pooled SD 

data:  evaluations and hair.length 

       Long   Medium
Medium 0.1356 -     
Short  0.1569 0.0048

P value adjustment method: none

The p-values are in the little table. For the Long-Medium comparison, p = 0.1356, for Long-Short, p = 0.1569, and for Medium-Short, p = 0.0048.

Two issues are raised. The first has to do with something you may remember from algebra. If A=B, and B=C, then A=C. Aren't we saying Medium=Long, and Long=Short, but Medium≠Short?

Question 12. NO, we are not saying that. Why not?

The second issue has to do with the fact that we have just done three significance tests at alpha=.05. By doing three tests and giving ourselves a 5% chance of making a Type I error on each one, we have effectively raised our alpha level to almost (but not quite) 0.15. Is there any way to avoid inflating our alpha level?

One way we can adjust for multiple tests is to do what's called a Bonferroni correction on the p-values from our Fisher LSD test. To do a Bonferroni correction, we take the p-values from our comparisons and multiply them by the number of comparisons we've done, in this case 3. That brings our p-values to...

> pairwise.t.test(evaluations, hair.length, p.adjust.method="bonferroni")

	Pairwise comparisons using t tests with pooled SD 

data:  evaluations and hair.length 

       Long  Medium
Medium 0.407 -     
Short  0.471 0.015 

P value adjustment method: bonferroni

Note. If the Bonferroni correction causes any of our p-values to be greater than 1, that p-value is set to 1, since p is a probability, and probabilities cannot be greater than 1. The test we have just completed is no longer called a Fisher LSD test, it is called a Bonferroni or Bonferroni-Dunn test. By the way, I should also note, the Tukey HSD test was originally intended only for balanced designs. If it is used for an unbalanced design, it is more correctly called the Tukey-Kramer procedure. So now we know four post hoc tests: Tukey HSD, Tukey-Kramer procedure, Fisher LSD, and Bonferroni-Dunn.

Question 13. Has the Bonferroni correction changed any of our conclusions about which means were significantly different?

Protected tests. Like the Tukey HSD test, the Bonferroni test is a very conservative test. Both tests correct very strictly for the number of comparisons that have been done, some would say too strictly. The Fisher LSD test makes no such correction, and thus does not correct for inflation of the alpha level due to multiple tests. For this reason, many researchers (and more importantly, journal editors) distrust the Fisher LSD test.

A general rule that is often followed is that post hoc tests should not be done unless the null hypothesis has been rejected in the ANOVA. If we follow this rule, our post hoc tests are called protected tests because they are, in a sense, protected by the ANOVA. We are no longer fishing for differences with out pairwise comparisons. We KNOW there are differences. The ANOVA has told us so. We're trying to find out where they are.

The Tukey and Bonferroni tests are so conservative that they probably don't need to be protected, but the Fisher LSD test does. You should not do the Fisher LSD test unless the null has been rejected in the overall ANOVA (statisticians call it the omnibus ANOVA). If the Fisher LSD test is protected in this way, then it will result in NO alpha inflation IF the number of comparisons is held down to three or fewer. You can even stretch the point a little and do four comparisons with very little alpha inflation. Journal editors will still be wary of you, however, if you do a Fisher LSD test, even if you protect it, because journal editors are not as statistically savvy as you are.

A Second Problem

Here's another one from a Psyc 497 project that you can do for practice.

Data are from Tom Prin (Psyc 497, Spring 2005). Tom tested firemen from three areas of the country, Horry County, SC, Charleston, SC, and New York City, recording their score on the Rotter Internal/External inventory (low scores indicate more internal), and the firemen's risk rating: A=risk taker, B=gets the job done w/o taking too many risks, and C=generally unwilling to take risks. The risk rating can also be expressed in terms of willingness to engage in "meritorious behavior." Tom's hypothesis was that risk A firemen were most willing to take risks because of of an external locus of control ("When my time is up, my time is up."). That will not be our hypothesis. We will test the null hypothesis that the populations from which these samples were drawn all have the same mean score on the Rotter test. Here are Tom's data.

Rotter scores from Risk=A firemen
 9 12  8  9 10 11  6  5 13 10  7  7  8 11  5  8  9 16  4 14 10  5 10  7 14  5
Rotter scores from Risk=B firemen
12  6 11  7 15 12 16 13  5  9  8  9  9 13  6  9 10 11 10 11  9 12  6 13 10  9
11  8  9 12  9 12 15
Rotter scores from Risk=C firemen
14 12 12 12 13  6 13 16  9 10 12 11 10 10 11  6

Descriptive Statistics

	Risk A	Risk B	Risk C
n
mean
SS
var

Necessary Intermediate Calculations for Fisher LSD

SS.error = Hint: this is unexplained or error variability.

df.error =

var.pooled =

sd.pooled =

t.crit =

Note: The critical t from the table would be for 60 degrees of freedom. The correct value for 72 degrees of freedom is 1.993.

Fisher LSD Tests

A vs. B

difference between the means
pooled standard deviation
standard error of the mean difference
calculated value of t
decision regarding the null

A vs. C

difference between the means
pooled standard deviation
standard error of the mean difference
calculated value of t
decision regarding the null

B vs. C

difference between the means
pooled standard deviation
standard error of the mean difference
calculated value of t
decision regarding the null

Bonferroni Corrections

We'll do the Bonferroni correction a little differently this time. Suppose you don't have R with it's convenient pairwise.t.test() function. What then? Then you use the table at the website to get a critical t corrected for the number of comparisons you're doing. That table is called Bonferroni adjusted critical t-values and you will find it with the rest of the tables at the bottom of the webpage. I've reproducd it below.

To use the table, read down the Df column at the left until you come to the correct degrees of freedom. We are using df=72 for these tests, but that's not in the table, so we'll use the next smaller value, 70. Then scan across that row until you come to the column that is marked at the top with the number of comparisons you're doing, in this case, 3.

Bonferroni-adjusted t.crit =

Using the adjusted t.crit, are any of the comparisons you did above significant? If so, which ones? (Hint: If they're not significant using Fisher LSD, they sure as heck won't be significant using Bonferroni!)

Holm-Bonferroni Test

The Fisher LSD test is perfectly valid for either of these problems without alpha inflation, since with three groups there are only three comparisons, but some people (journal editors, graduate advisers, they are more or less people) have a bee in their bonnet about the Fisher LSD test. On the other hand, the Bonferroni-Dunn test is too conservative, especially when there are a large number of comparisons. It's even more conservative than the Tukey HSD test.

Another test that you may have been taught about, but that you should not use when you are doing pairwise comparisons is the Scheffe test. Scheffe did not intend his test to be used for all post hoc, protected, pairwise comparisons, and he said so himself. If that's all you want, he said, use the Tukey HSD test.

It would be nice to have a test that is a compromise between the Fisher LSD and the Bonferroni-Dunn, and we do. It's called the Holm test (or Holm procedure, or Holm correction) or Holm-Bonferroni test.

The Holm-Bonferrnoni test can be a little confusing, so I've prepared a special dataset with which we can demonstrate its use. Get the dataset as follows (you can also use getData() if you still have it):

> file = "http://ww2.coastal.edu/kingw/psyc480/data/eysyng4.txt"
> eys = read.table(file=file, header=T, stringsAsFactors=T)
> summary(eys)
     Recall         Instructions
 Min.   : 4.00   Adjective:10   
 1st Qu.:10.00   Control  :10   
 Median :15.00   Imagery  :10   
 Mean   :14.82   Rhyming  :10   
 3rd Qu.:18.25                  
 Max.   :22.00

This is a stripped down version of eysenck.txt. It's a memory experiment in which subjects, randomly assigned to groups, are recalling words from a list of 30 words to which they have been previously exposed. The IV is the type of Instructions the subjects were given for dealing with the list. I have removed all of the older subjects from the data frame, and I have also removed one level of Instructions, so that we have four groups of younger subjects. Let's get the ANOVA without further delay.

> aov.out = aov(Recall ~ Instructions, data=eys)   # data are in a data frame
> summary(aov.out)
             Df Sum Sq Mean Sq F value   Pr(>F)    
Instructions  3  799.3  266.42   35.72 6.89e-11 ***
Residuals    36  268.5    7.46                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> 6.89e-11 < .05   # in case you weren't sure!
[1] TRUE

The p-value is quite tiny, so we are justified in proceding to post hoc tests to see just which of these groups were significantly different. We will use the pairwise.t.test() function to get p-values from the various tests.

> with(eys, pairwise.t.test(Recall, Instructions, p.adjust="none"))   # Fisher LSD

	Pairwise comparisons using t tests with pooled SD 

data:  Recall and Instructions 

        Adjective Control Imagery
Control 0.00075   -       -      
Imagery 0.02782   0.17249 -      
Rhyming 9.6e-07   1.9e-11 9.7e-10

P value adjustment method: none 
> 
> with(eys, pairwise.t.test(Recall, Instructions, p.adjust="bonf"))   # Bonferroni

	Pairwise comparisons using t tests with pooled SD 

data:  Recall and Instructions 

        Adjective Control Imagery
Control 0.0045    -       -      
Imagery 0.1669    1.0000  -      
Rhyming 5.8e-06   1.2e-10 5.8e-09

P value adjustment method: bonferroni 
> 
> with(eys, pairwise.t.test(Recall, Instructions, p.adjust="holm"))   # Holm

	Pairwise comparisons using t tests with pooled SD 

data:  Recall and Instructions 

        Adjective Control Imagery
Control 0.0022    -       -      
Imagery 0.0556    0.1725  -      
Rhyming 3.8e-06   1.2e-10 4.9e-09

P value adjustment method: holm

We have six comparisons, so you should confirm that all the Bonferroni p-values are six times the corresponding Fisher LSD p-values. There are some slight rounding issues here, but don't let them get your shorts in a twist. By the way, you can do these multiplications in R quite simply. For example:

> 9.6e-07 * 6   # Rhyming-Adjective
[1] 5.76e-06

That one appears to be right.

Getting the Holm p-values involves the following steps.

Find the smallest of the Fisher LSD p-values and multiply it by 6 (the number of comparisons). If the result is still less than alpha, continue. Otherwise, you're done. You've found no significant comparisons. Here, that would be 1.9e-11 * 6 = 1.14e-10 (given in the above tables as 1.2e-11 due to rounding issues). That's less than alpha, so that comparison is significant, and we continue.
Find the next smallest of the Fisher LSD p-values (2nd smallest) and multiply it by 5 (one less than the number of comparisons). If the result is still less that alpha, continue. Otherwise, you're done. There are no further significant comparisons. Here, that is 9.7e-10 * 5 = 5.82e-09. That's less than alpha, so that comparison is significant, and we continue.
Find the next smallest of the Fisher LSD p-values (3rd smallest) and multiply it by 4 (two less than the number of comparisons). Here, that is 9.6e-07 * 4 = 3.84e-06. That's less than alpha, so that comparison is significant, and we continue. (If it's not less than alpha, we stop.)
Are you getting gist of it? It's like the Bonferroni test, except at every step, we reduce the multiplier by one. The next one would be 0.00075 * 3 = 0.00225. The one after that would be 0.02782 * 2 = 0.05564. And there we stop, because we have exceeded alpha. That comparison is not signficiant, and by default, no matter what result the multiplication by 1 produces, the last comparison is also not significant.

I think you can see that it's best to let your software calculate these p-values. It's easy to get lost in the matrix while you're doing this. By the way, p.adjust="holm" is the R default, so you don't really need to include that in the pairwise.t.test() command.