Repeated Measures ANOVA

PSYC 480 -- Dr. King

More Repeated Measures ANOVA

This should work in RStudio Cloud (but I haven't tested it). I'm going to use R. All you need to know FOR NOW is what the output means.

Card Matching

Clear your workspace. Copy everything between the dashed lines and paste it into your R Console. Then press the Enter key.

# --------------------------------------------------------------------------
# clear the workspace
rm(list=ls())
# get a couple data files
file = "http://ww2.coastal.edu/kingw/psyc480/data/matching.txt"
cards = read.table(file=file, header=T, stringsAsFactors=T)
file = "http://ww2.coastal.edu/kingw/psyc480/data/matchinglong.txt"
cards2 = read.table(file=file, header=T, stringsAsFactors=T)
# get the rmsaov.test() function
source("http://ww2.coastal.edu/kingw/psyc480/functions/rmsaov.R")
# create the SS() function
SS = function(X) sum(X^2) - sum(X)^2 / length(X)
# inspect the workspace
ls()
# --------------------------------------------------------------------------

You now have a couple data sets in your workspace, along with the rmsaov.test() function and a custom function to calculate sum of squares.

> ls()
[1] "cards"       "cards2"      "file"        "rmsaov.test" "SS"

The data are from Wes Rowe (Psyc 497 Spring 2009). Wes was in one of my 497 classes, and when he told me he wanted to do something with background music, I probably rolled my eyes. That never works! Not only did it work, but the effect he saw was certainly one I never would have expected!

Subjects performed a card matching task under each of three different background music conditions. In this task, an array of playing cards is placed in front of the subject face down. The subject's task is to turn the cards over two at a time and find matching pairs of cards. If the cards don't match, they are turned face down again. If they do match, they are removed from the array. Time required to complete all matches was recorded (in seconds). All subjects were CCU students living in student housing.

A graphic stolen...er, um, borrowed from the Internet showing
a match. These two cards would be removed from the array.
card matching task

The head() function shows the first several lines of a data frame.

> head(cards, 10)   # first ten rows of data
     X Gender NMTx CMTx IRMTx
1   S1   Male  218  203   161
2   S2   Male  181  196   167
3   S3   Male  158  164   180
4   S4   Male  169  178   151
5   S5 Female  190  182   166
6   S6 Female  141  169   178
7   S7   Male  229  221   196
8   S8 Female  169  156   152
9   S9   Male  208  206   192
10 S10   Male  196  178   173

X is a subject identifier. Gender is gender. NMTx is solving time in a no-music control condition. CMTx is solving time in a classical music condition. IRMTx is solving time in an instrumental rap music condition. Long format or wide (short) format?

> head(cards2, 10)   # first ten rows of data
   times gender     music subjects
1    218   Male      none       S1
2    203   Male classical       S1
3    161   Male   instrap       S1
4    181   Male      none       S2
5    196   Male classical       S2
6    167   Male   instrap       S2
7    158   Male      none       S3
8    164   Male classical       S3
9    180   Male   instrap       S3
10   169   Male      none       S4

Long format or wide format?

The "cards" data frame is in wide (short) format. The first column, X, is a subjects identifier. We don't need that in a wide format data frame as each line (row) in the data frame represents all scores from one subject. The second column is subject gender. There was no (obvious) gender effect on any of the three types of background music scores (by t-test, but what kind of effect wouldn't be found this way? ), so we're going to get rid of that variable as well. We will "NULL" them out.

cards$X = NULL
cards$Gender = NULL
head(cards)   # six lines of data is the default
  NMTx CMTx IRMTx
1  218  203   161
2  181  196   167
3  158  164   180
4  169  178   151
5  190  182   166
6  141  169   178

Now the data frame is all numeric with one subject per row and the DV in the columns, perfect for use with the rmsaov.test() function (but we have to remember to make it a matrix first). The three kinds of background music were none (NMTx = no music solving time), classical (CMTx = classical music solving time), and instrumental rap (IRMTx = instrumental rap music solving time). I didn't even know there was such a thing as instrumental rap when Wes did this study, so this was an education for me, too.

The "cards2" data frame is in long format. The existence of the gender column does no harm. The subjects column must be there. The IV is music. The DV is times, which is solving time in seconds. The IV and DV each have their own column in a long format data frame.

> summary(cards2)
     times          gender         music       subjects  
 Min.   :141.0   Female:81   classical:52   S1     :  3  
 1st Qu.:172.8   Male  :75   instrap  :52   S10    :  3  
 Median :185.5               none     :52   S11    :  3  
 Mean   :186.6                              S12    :  3  
 3rd Qu.:199.0                              S13    :  3  
 Max.   :242.0                              S14    :  3  
                                            (Other):138

How do we know this summary is from a long format data frame? We know because there is only one summary of the DV. If you were to summarize the wide (short) format data frame (go ahead and try it), you would not get a summary of the DV, you would get a summary of each level of the DV separately. That is, you would have a summary of each music type separately. Because cards2 is a long format data frame, the rmsaov.test() function will not work with it.

How may subjects did Wes test in his study? Be careful! It's tricky!

(Answer: You might be tempted to think 81 females plus 75 males equals 156 subjects. You'd be wrong. Remember, in a long format data frame, each subject has as many rows in the data frame as there are treatments. In this case, there are three treatments, so each subject has three rows. Thus, there are not 156 subjects, but 156/3=52 subjects, and they would be 27 females and 25 males. The repeated measures variable, music, gives that away. Each subject does each type of background music once. Therefore, there are 52 subjects.)

This is the sort of study in which order effects might play a substantial role in the results. Let's look at the means in the wide format data frame using apply().

apply(cards, 2, mean)   # the syntax is (name of data frame, 2, function)
    NMTx     CMTx    IRMTx 
188.5577 189.3269 182.0385

Note: The 2 in the apply() syntax causes the means to be calculated down the columns. Replacing that by 1 would calculate the row means. The apply() function cannot be used with the long format data frame. Why not?

Suppose the treatments had been given to every subject in that order: no music, followed by classical music, followed by instrumental rap music. What could account for the effect we're seeing?

It could be an effect of background music type. That was Wes's hypothesis. But it could also be simply an effect of order. Maybe practice made the subjects better at the task, and that's why we see the drop in solving time when the subjects are doing the task for the third time.

What would this kind of screw-up be called? Specifically, type of background music is confounded with order of presentation, because we can't tell which of those variables is actually producing the effect, i.e., the decrease in solving times.

One way to control for this is to randomize the order in which the treatments are given to each subject. A better way is to counterbalance. Wes used a form of counterbalancing called complete counterbalancing, in which every possible order of treatment presentation is used, and subjects are assigned at random to one of those orders, with the caveat that each of the orders occurs an equal number of times. With 3 treatments, the number of possible orders is 6 (=3*2*1).

When counterbalancing is used, we can actually analyze for an order effect by including order as one of the IVs in our analysis. Since order would be a between groups variable, that would make the design a mixed factorial. Take my word for it. There was no order effect. So the order variable will not be analyzed. (It's not even in this version of the data, in case you were looking for it.)

Do we need to look at variances at this point to check the homogeneity of variance assumption?

Normal distribution is also not difficult to check if the data frame is in wide format. We're going to do it graphically, by plotting something called a normal QQ plot. Here are the commands I used to do these plots. You can copy and paste them into R.

par(mfrow=c(1,3))
# a graphics window appears
# at this point I used the mouse to make this window wider than it is tall
# then I drew the plots for each music condition in turn
qqnorm(cards$NMTx, main="NMTx")
qqnorm(cards$CMTx, main="CMTx")
qqnorm(cards$IRMTx, main="IRMTx")

The par() function opened a graphics window and set it so that I could plot one row of three graphs. The qqnorm() function does the plotting. All we have to do is name the variable we want to plot. I got a little fancy and gave each of the graphs a main title as well. Here are the graphs.

Without going into details, what we want to see on each of those graphs is a straignt line of points going from the lower left corner to the upper right corner. It's not going to be perfectly straight because these are samples and samples are... what? But I'd say they look pretty good. If those lines are bowed, the sample data are skewed. If the bow sags down towards the lower right corner, that's positive skew (skew to the right). If the bow humps up towards the upper left corner, that's negative skew (skew to the left). If that line does weirder stuff, like snakes in an S shape up across the graph, that's an even worse problem than skew! Let's just hope that doesn't happen! In this case, the solving times appear to have a very slight positive skew in all three conditions, but I think we can live with it.

(If you don't remember what a skewed distribution is from your first stat course, Google it!)

Let us analyze.

rmsaov.test(as.matrix(cards))

	Oneway ANOVA with repeated measures

data:  as.matrix(cards)
F = 11.581, df treat = 2, df error = 102, p-value = 2.935e-05

 MS.error  MS.treat  MS.subjs 
 72.00289 833.85256 922.62544 

Treatment Means
    NMTx     CMTx    IRMTx 
188.5577 189.3269 182.0385 

HSD.Tukey.95 
    3.957991 

Compound Symmetry Check (Variance-Covariance Matrix)
          NMTx     CMTx    IRMTx
NMTx  441.7025 313.6180 278.8213
CMTx  313.6180 294.2244 258.1833
IRMTx 278.8213 258.1833 330.7044

Estimated Greenhouse-Geisser Epsilon and Adjusted p-value
    epsilon    p-adjust 
0.805520067 0.000127389

I did not want to permanently wreck my data frame by converting it to a matrix forever and all time, so I temporarily converted it to a matrix "on the fly" by using the above method. Let's take a look at the result.

First, we see the ANOVA result, F(2, 102) = 11.581, p < .001. Are those degrees of freedom correct? There were three levels of the music variable, so df=2 is correct for the treatment. There were 52 subjects, so df.subjects=51, which is not shown in the output. There were 3*52=156 scores altogether, so df.total=155, also not shown. That leaves df.error=155-51-2=102. Looks like the computer got it right. Here's another trick for getting error degrees of freedom in these designs. Multiply df.treatment times df.subjects. In this case, that would be 2x51=102.

The p-value is miniscule, so we are convincingly rejecting the null hypothesis. There is at least one significant difference among the mean solving times for the different music types.

Next, the mean squares are given. We can use these to write out a complete ANOVA table, should the desire to do so overwhelm us. And it eventually will, so see if you can do it. The answer will appear below. Some of you have apparently forgotten that MS=SS/df, which means you can get the sums of squares by SS=MS*df. SSes are additive, so SS.total can be had by just adding up SS.treat, SS.subjects, and SS.error. Or, if we have a long format data frame handy...

SS(cards2$times)
[1] 56065.9

Don't forget that SS.total is just the SS of the dependent variable. We could have gotten that from the wide data frame, too, but it would have taken a little more work. For the hopelessly curious...

SS(c(cards$NMTx, cards$CMTx, cards$IRMTx))
[1] 56065.9

Ask if you don't get it (and want to).

The treatment means are next. We've already seen those.

Next comes something called HSD.Tukey.95. That is the Tukey honestly significant difference. Any difference between two means larger than that value is statistically significant by Tukey HSD test. Are there any?

Then we have the variance-covariance matrix. The treatment variances are on the main diagonal of that matrix. In case you forgot, I'll erase everything else so that the main diagonal becomes more obvious.

          NMTx     CMTx    IRMTx
NMTx  441.7025
CMTx           294.2244
IRMTx                   330.7044

What do you think? Have we done okay on homogeneity of variance? Remember, these are sample variances, and samples are noisy, so it would be unreasonable to expect these variances to be the same. But are they close enough that we don't have to worry about the population variances? I'm mildly uncomfortable, but I'm not going to get my shorts in a twist over it.

What about the covariances? Those are the numbers that are off the main diagonal, and they are the same above and below the main diagonal, so we need look at them only once. Covariances do not have to be the same as the variances, and usually won't be, but they should be similar to each other. Are they?

          NMTx     CMTx    IRMTx
NMTx
CMTx  313.6180
IRMTx 278.8213 258.1833

Greenhouse-Geisser Epsilon (GGE) = 0.81, with a best possible value of 1.00 and a worst possible value of...? The adjusted p-value is still very small. So I'd say we've done okay here on the assumptions.

We could have done the analysis as treatment x subjects using the long format data frame as well, but we wouldn't have gotten all the fancy diagnostics. We would have gotten something else though.

aov.out = aov(times ~ subjects + music, data=cards2)   # mind the + sign
summary(aov.out)
             Df Sum Sq Mean Sq F value   Pr(>F)    
subjects     51  47054   922.6   12.81  < 2e-16 ***
music         2   1668   833.9   11.58 2.93e-05 ***
Residuals   102   7344    72.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is this the same result? By the way, this is the "conventional ANOVA table" that I referred to above, so check yours against this one.

It looks like the same result to me. But this method of analysis gives us an aov.out object, and that allows us to do this.

TukeyHSD(aov.out, which="music")
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = times ~ subjects + music, data = cards2)

$music
                        diff        lwr      upr     p adj
instrap-classical -7.2884615 -11.246453 -3.33047 0.0000851
none-classical    -0.7692308  -4.727222  3.18876 0.8890419
none-instrap       6.5192308   2.561240 10.47722 0.0004724

What have we done? What is your interpretation of that? You can reconstruct this table from the rmsaov.test() output using the means and the value for HSD.Tukey.95 (except for the p-values). Can you figure it out? Here is a quick summary of how to do it.

diff is just the difference between the two means being compared
lwr is the lower limit of the 95% confidence interval and is diff-HSD.Tukey.95
upr is the upper limit of the 95% confidence interval and is diff+HSD.Tukey.95
p-values are harder to get; you'd have to settle for p<.05 or ns (for not significant)

Okay, now for a hard question. When we threw out the gender variable, what did we throw out along with it? Think about it a minute. I said none of the three music types showed a gender effect. Here is one of those tests from the wide (short) format data frame. Too late to do this now, after we've nulled out the Gender column, but I did it before we did that.

> t.test(IRMTx ~ Gender, data=cards, var.eq=T)   # you can't do this now

	Two Sample t-test

data:  IRMTx by Gender
t = 0.029644, df = 50, p-value = 0.9765
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.08773  10.38995
sample estimates:
mean in group Female   mean in group Male 
            182.1111             181.9600

The other two p-values were both 0.2. So what have we lost by tossing out that gender variable?

Got it yet?

============================================================================
What follows is the analysis of a mixed factorial design, with music as the
within subjects variable and gender as the between groups variable. You do
NOT need to know how to do this at tis time. I'm just making a point. When we
throw out a variable, we lose any chance of seeing an effect of that variable.
We still have the gender variable in the long format data frame, so...
============================================================================

aov.mixed = aov(times ~ gender * music + Error(subjects/music),data=cards2)
summary(aov.mixed)

Error: subjects
          Df Sum Sq Mean Sq F value Pr(>F)
gender     1    841   840.6   0.909  0.345
Residuals 50  46213   924.3               

Error: subjects:music
              Df Sum Sq Mean Sq F value   Pr(>F)    
music          2   1668   833.9  12.107 1.96e-05 ***
gender:music   2    457   228.5   3.318   0.0403 *  
Residuals    100   6887    68.9                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How about now?

When we threw out gender, we lost the possibility of seeing an interaction between music and gender. And lookie there! That interaction is significant, too. It looks like the effect of music type is different for females vs. males. Let's see if we can figure out how.

with(cards2, tapply(times, music, mean))
classical   instrap      none 
 189.3269  182.0385  188.5577

There are the overall means from the long format data frame. We've seen those. What we need are the means for males and females separately.

> with(cards2, tapply(times, list(gender,music), mean))
       classical  instrap     none
Female  186.2222 182.1111 184.8889
Male    192.6800 181.9600 192.5200

The point you should remember is that, to see the interaction, we need a table of cell means that looks like this, however you get it.

         Mean Solving Times (sec)
          none    classical  instrap
Male     192.52    192.68    181.96
Female   184.89    186.22    182.11

Here's a preliminary question. Overall, who did better, men or women?

Now can you describe the interaction? It looks like women performed well regardless of the type of background music. In fact, if you tested the simple effect, it would not be significant for women. The men only did as well as the women on instrumental rap. They were 6-8 seconds worse in the other two conditions (and that simple effect is significant).

Now that's an interesting effect! Why would that happen?

Reaction Time

Here's another very similar analysis, also from a Psyc 497 project. I'll just show the steps in the analysis. Can you explain what's being done and what is being found?

#--------------------copy and paste these lines to follow in R--------------------
rm(list=ls())   # clear workspace (or use the menu)
# read in wide format version of the data frame (only available in csv)
file = "http://ww2.coastal.edu/kingw/psyc480/data/rtcolor.csv"
RT = read.csv(file=file, header=T, stringsAsFactors=T)
# summarize
summary(RT)
#--------------------------------------------------------------------------------
      red             white             blue            green       
 Min.   :0.4770   Min.   :0.4560   Min.   :0.4500   Min.   :0.4000  
 1st Qu.:0.6085   1st Qu.:0.6015   1st Qu.:0.5760   1st Qu.:0.5455  
 Median :0.6990   Median :0.6645   Median :0.6405   Median :0.6045  
 Mean   :0.6974   Mean   :0.6608   Mean   :0.6502   Mean   :0.6231  
 3rd Qu.:0.7598   3rd Qu.:0.7332   3rd Qu.:0.7250   3rd Qu.:0.6893  
 Max.   :0.9530   Max.   :0.8920   Max.   :0.9380   Max.   :0.8870

Here is a description of the data.

Data are from Josh Weiner PSYC 497 Fall 2009. He did a reaction time experiment in which he tested each subject for 5 trials on each of four different colored light stimuli. Recorded reaction time is the mean of those 5 trials for each color by subject. It's a wide format data frame, so each row is a subject and contains all four scores for that subject. There are no IV and DV columns as there would be in a long format data frame.

> head(RT)
    red white  blue green
1 0.662 0.672 0.665 0.602
2 0.695 0.728 0.608 0.670
3 0.659 0.640 0.725 0.673
4 0.806 0.761 0.764 0.785
5 0.517 0.536 0.507 0.400
6 0.744 0.828 0.848 0.766

What is the potential confound here? Josh had to be very careful to control for something. What? Okay, order. He did that by randomizing. But there is something else, too.

> apply(RT, 2, mean)
      red     white      blue     green 
0.6973913 0.6608261 0.6502391 0.6231304
> par(mfrow=c(1,4))
> qqnorm(RT$red, main="red")
> qqnorm(RT$white, main="white")
> qqnorm(RT$blue, main="blue")
> qqnorm(RT$green, main="green")

> source("http://ww2.coastal.edu/kingw/psyc480/functions/rmsaov.R")
> rmsaov.test(as.matrix(RT))

	Oneway ANOVA with repeated measures

data:  as.matrix(RT)
F = 18.92, df treat = 3, df error = 135, p-value = 2.653e-10

   MS.error    MS.treat    MS.subjs 
0.002298121 0.043481295 0.040506595 

Treatment Means
      red     white      blue     green 
0.6973913 0.6608261 0.6502391 0.6231304 

HSD.Tukey.95 
  0.02600257 

Compound Symmetry Check (Variance-Covariance Matrix)
              red       white       blue       green
red   0.012013799 0.008655581 0.01060839 0.009776370
white 0.008655581 0.010172814 0.00892922 0.008354757
blue  0.010608393 0.008929220 0.01324241 0.010988390
green 0.009776370 0.008354757 0.01098839 0.011971938

Estimated Greenhouse-Geisser Epsilon and Adjusted p-value
     epsilon     p-adjust 
9.292197e-01 9.976223e-10

I hope you remember how scientific notation works!

interpret the ANOVA result
check for homogeneity and sphericity
interpret Greenhouse-Geisser epsilon

Here is a Tukey HSD test from an ANOVA on the long format data frame, which I happen to have. (We didn't download it.) Can you reproduce the table (except for the p-values) from the result above? What is being found here?

> TukeyHSD(aov.out, which="Color")
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = RT ~ Color + factor(Subject), data = RT2)

$Color
                   diff         lwr          upr     p adj
green-blue  -0.02710870 -0.05311127 -0.001106125 0.0374772
red-blue     0.04715217  0.02114960  0.073154745 0.0000347
white-blue   0.01058696 -0.01541561  0.036589528 0.7148841
red-green    0.07426087  0.04825830  0.100263441 0.0000000
white-green  0.03769565  0.01169308  0.063698223 0.0013701
white-red   -0.03656522 -0.06256779 -0.010562646 0.0020382

Practice Problems

1) Sleep Drugs.

The data are from A.R. Cushny & A.R. Peebles (1905). The action of optical isomers. II. Hyoscines, Journal of Physiology, v.32, pgs.501-510. The data come from a table in that article. The researchers were testing the ability of three hypnotic drugs to produce sleep in people with varying degrees of insomnia. The DV is sleep time in hours. Each patient was measured several times with each drug, and the sleep times are averages of those trials. The object, of course, is to find if the drugs differ in effectiveness. (Note for future psychopharmacology students: L-Hydroscyamine is also called L-atropine. It is the chemical precursor of Hyoscine, which is also called scopolamine, at one time an ingredient in certain cold medicines, but since removed as it is considered too dangerous. Both are muscarinic acetylcholine antagonists.) Historical note: these data were used by Wm. Gosset, aka A.Student, in his development of the t-test.

Copy and paste the following:

# -----------------------------------------------------------------------------
rm(list=ls())
SD = read.csv("http://ww2.coastal.edu/kingw/psyc480/data/sleepdrugs.csv")
SD2 = read.csv("http://ww2.coastal.edu/kingw/psyc480/data/sleepdrugslong.csv")
source("http://ww2.coastal.edu/kingw/psyc480/functions/rmsaov.R")
SS = function(X) sum(X^2) - sum(X)^2 / length(X)
ls()
# -----------------------------------------------------------------------------

Answer: F = 10.277, df treat = 3, df error = 27, p-value = 0.0001096

2) Base Running

Hollander & Wolfe (1973), p.140ff. Comparison of three methods ("round out", "narrow angle", and "wide angle") for rounding first base. For each of 22 players and the three methods, the average time of two runs from a point on the first base line 35ft from home plate to a point 15ft short of second base is recorded. (Times are in seconds.) The obvious question is, is one method better than the others, and if so, by how much?

Copy and paste the following:

# -----------------------------------------------------------------------------
rm(list=ls())
BR = read.csv("http://ww2.coastal.edu/kingw/psyc480/data/baserunning.csv")
BR2 = read.csv("http://ww2.coastal.edu/kingw/psyc480/data/baserunlong.csv")
source("http://ww2.coastal.edu/kingw/psyc480/functions/rmsaov.R")
SS = function(X) sum(X^2) - sum(X)^2 / length(X)
ls()
# -----------------------------------------------------------------------------

Answer: F = 6.2883, df treat = 2, df error = 42, p-value = 0.004084

The End