R Tutorials by William B. King, Ph.D.
| Table of Contents | Function Reference | Function Finder | R Project |

UNBALANCED FACTORIAL DESIGNS


The Long and The Short of It

I suspect there are some people who won't want to read a tutorial this long--and it is the longest one by quite a bit--to find out how to deal with unbalanced factorial designs. It's not a simple issue, and there is no universally accepted answer. To quickly summarize, when a factorial design is balanced, the factors are independent of each other. "Orthogonal" is the word that is commonly used, meaning they are uncorrelated or unconfounded. In this case, any statistical software should give the "right" answer, as there is no dispute over how the calculations should be done.

The problem arises in unbalanced designs because the factors are no longer orthogonal. Because the factors are confounded, they "explain" overlapping parts of the total variability, a problem most frequently encountered in multiple regression analyses where the explanatory variables are almost always correlated to some degree. The problem arises then in trying to figure out how to tease apart these confounds. How should the total variability be partitioned among the the various main effects and interactions?

Three methods are in common use, some more commonly than others! There isn't even uniform agreement on what these methods should be called, so I'm going to adopt what has become known as the "SAS terminology." The issue is over how to calculate the sums of squares, and the SAS people refer to the three methods as Type I (or Type I sums of squares, but I will shorten it to Type I, Type I tests, or Type I analysis), Type II, and Type III. (Note: Not to be confused with Type I and Type II error, which is an entirely different thing. To keep them straight, in this tutorial I'll refer to the Type I error rate as the false positive rate.)

Different statistical software packages use different methods by default. R uses Type I sums of squares by default, arguing quite vigorously in some instances that there are no other "types of sums of squares." (See this reference, for example.) Most of the large commercial packages, such as SAS, SPSS, and Minitab, use Type III (although the others are available as options).

I will explain the differences below. I will try to explain what each of these methods does, because I think a well informed statistical analysis is a better statistical analysis. I will also try to justify the recommendations that I am about to give. But for this moment, to satisfy the people who just want to get an analysis on paper, let me tell you what I would do. (Which, of course, MUST be absolutely correct!:) Be forewarned. I can only give you a partial recommedation at this time.

If the design is balanced, there is no issue. All three methods give the same result. I'm not sure those results are intended to be interpreted the same way, but they will give you the same numbers. It's when the design is unbalanced that you have to start to think about this.

If the design is a true experimental design, by which I mean subjects have been randomly assigned to conditions within the experiment, and the design is mildly unbalanced for one of the following reasons:

  1. Either the design never was balanced because the randomization put a couple more subjects in some conditions than in others, or
  2. The design began balanced but became mildly unbalanced by accident and at random with respect to the treatments, probably due to random subject attrition, equipment failure, nonresponse, etc.,
then use Type III. I hope I'm allowed to continue using R after saying such a thing, because Type III sums of squares are anathema to R people. They are not even implemented in the R base packages, and you will have to either use a workaround, or download an optional package ("car") to get them. Even then, you'll have to take special precautions to make sure you've done them right. SAS and SPSS spit them out by default.

If the design is unbalanced for some other reason, for example, if the design is quasi-experimental, using intact groups rather than random assignment to conditions, making it likely that the unbalanced condition reflects what is true in the population (the sample is unbalanced because the population is unbalanced), then the choice of an analysis is much more difficult. This is especially true if the design is more than mildly unbalanced, as I hope to demonstrate, or in the presence of an interaction that is unclear, i.e., marginally significant or not significant but still suspiciously large. In this case, you should choose an appropriate analysis depending upon what your hypotheses are and why the analysis is being done in the first place, i.e., what you hope to find out.

In the nonexperimental case, if you must have something that resembles a traditional ANOVA, then Type II is the clear choice, but be sure you're aware of the problems with Type II tests when an interaction, significant or not, is in force.

I know that's not very helpful at this point. Let me say, if the design is only very mildly unbalanced, then it probably doesn't matter a whole lot which method you choose. Try all three, and if you get similar results, flip a coin. DO NOT choose a method because you like the result it gives. That would be called "fishing for statistics." If the three methods give you different results, however, then you're going to have to understand a lot of what follows. Sorry!

In any event, you should state in the written description of your analysis how the analysis was done, i.e., what type sums of squares were calculated!


When Should We Use Type III and Why?

If you're reading these tutorials, then you're probably already an "R person" or an "R wannabe" or at least someone who's curious about R. You may already have heard Bill Venables or some other R big dog paraphrased: "Type III sums of squares--just say no!" Given that I'm not a big dog and have already recommended Type III for mildly unbalanced experimental data, I think I need quickly to justify that before R people brand me a heretic and stop reading! After you hear my arguments, you're free to stop reading if you like. So before I even tell you what they are, here's why I (and a lot of other people) think you should use Type III tests for unbalanced experimental data.

There is ONE case where Type III makes sense, in my opinion, and that is when the design is mildly unbalanced and the cell sizes are random and meaningless with respect to the treatments. This is most likely to happen in a designed experiment in which subjects have been randomly assigned to treatment conditions, and there has been minor and random subject attrition, or minor data loss for other, random, reasons.

I'll offer four arguments to support this assertion.

First, if cell sizes are random with respect to treatments, which is to say, there is no force of nature creating these cell sizes, making them effectively meaningless to the interpretation of the data, then a method of analysis should be used in which differential cell sizes do not figure into the hypotheses being tested. Of the three types, the only one that does this is Type III.

Second, if cell sizes are random and meaningless with respect to treatments, then there is no reason to believe that cell sizes would be the same or similar in a replication of the experiment. Therefore, if cell sizes are part of the hypotheses being tested, as they are in Type I and Type II tests, this could make replication of experimental results more difficult.

Third, it is generally true in experimental research that the researcher is seeking to test hypotheses about main effects of the form:

Null Hypothesis

Where mu1, etc., are cell-frequency-independent population means. If we are adhering to this tradition, in which "no effect" means equality of the population means, and those means are estimated, as is traditional, by unweighted marginal means in the sample, then we are left with little choice as to a method of analysis. The only one of the methods that will test for the equality of population means estimated from the sample as unweighted marginals is Type III.

How do we know that the population means are best estimated by the unweighted marginals? Venables comes right out and calls this an entirely arbitrary hypothesis. Perhaps the population means are better estimated by the weighting scheme that is used in the Type I or Type II method.

I would consider this to be a strong argument if the DV were being measured on intact groups (i.e., a quasi-experimental design), but in experimental research, let's face it, the population is a fiction. There is no population of people who have been given a list of thirty words to memorize, no population of rats that have been injected with LSD, and no population of agricultural plots growing alfalfa and fertilized with...whatever, at least not one that we are sampling from. What I want to know is, if I tell this group to do this, and that group to do that, will it make a difference in the number of words they can recall? In this case, "no difference" is defined as equal means, or means equal enough that we can't be confident that the differences are due to anything other than random error. And if I have a second, cross-classified IV, then I'm not interested in seeing how that variable changes the outcome of my "this-that" variable. At least I'm not until I'm looking at the interaction, and presumably I already have. If there's an interaction, then I wouldn't be looking at the main effects. Thus, when I do decide to look at the main effects, I want the influence of one variable on the other removed. I don't want to see the effect of B above and beyond the effect of A. I want to see the effect of B after any and all variability due to A has been removed. That includes variability created by the possibility that A may have an interaction with B, even though we didn't find it significant. In other words, when I'm dealing with experimental data, I want the entire confound of A with B removed from the main effects.

Profile Plot

It's trickier than you might think. Take a look at the profile plot at left. These are plotted means from a quasi-experimental design that was seriously unbalanced. What effects do you see? Clearly there's an interaction, so perhaps we should stop there. But the interaction is not significant and, of course, if students were being asked to answer this question on a test, we'd also expect them to evaluate the main effects, and that's not so hard in this case. It looks like the "average Cut.Scar line" would be pretty much flat, so there is no main effect of Cut.Scar. The two "Sex lines" are clearly separated, however. The distance between the middle of those lines appears to be about 4-5 points. So a main effect of Sex is being shown. Is that how you did it? Is that how you teach your students to do it? Then you're evaluating the main effects using unweighted marginal means. You have effectively done a Type III analysis in your head. (And that would be the case even if there were no interaction, i.e., if the lines were parallel.) The Type II analysis finds just the reverse, a significant "main effect" of Cut.Scar (p=.02) and no significant "main effect" of Sex (p=.3). The interaction was not significant no matter what type sums of squares were calculated. (The Type III analysis finds p=.1 for the interaction, p=.9 for the main effect of Cut.Scar, and p=.06 for the main effect of Sex.)

I hasten to point out that this is not a contrived data set. If the literature on unbalanced designs has taught us anything, it's that it is easy to contrive a set of data that make one or the other of the methods look bad. This is not one of those. These are real results from an actual study. They will be discussed in more detail below, where I will suggest using Type I tests to evaluate them.

Fourth, yes, it's true that Type III tests on the main effects violate the principle of marginality (to be explained eventually), but such a violation is necessary in order that the tests have the above properties. As long as the unbalanced condition is mild, such a violation will have little effect on the outcome of the tests.

Once again, in experimental research, I want to see the effect of B after A + A:B has been removed. If the design is balanced, and the effects are othogonal, then that is the same thing as saying effect of B after A is removed. If the design is not balanced, and the variables are not orthogonal, then to get the influence of A entirely removed, A:B also has to be removed. Yes, that's a violation of marginality, and to minimize any deleterious effects such a violation might cause, the design should be kept balanced, or as close to balanced as possible.

"Yes, but if there is no interaction, why are we talking about A:B at all?" In experimental research, main effects are (almost) never interesting in the presence of an interaction, but interactions don't always show themselves. They don't always turn out to be statistically significant, that is. If we have every good reason to believe the interaction doesn't exist, then don't put it in the model to begin with. Don't analyze for it. Don't throw it out; don't even look for it. That's not usually the case, so the interaction ends up in the analysis, and it will have some of the total variability associated with it. That may be nothing but random error. Nevertheless, there it is, and it has to be dealt with.

If the interaction is nonsignificant but still suspiciously large, then it might behoove the data analyst to find some other way to look at the "main effects" other than the traditional main effect tests. After all, traditional main effects are pretty much meaningless in the presence of an interaction, and just because it's not significant doesn't mean it isn't there. Hey, the Type III people play pretty rough with the Type II people on that issue, and here it is coming back to haunt you. Type III sums of squares DO NOT, by some miracle of mathematics, make it meaningful to look at main effects in the presence of an interaction!

DO NOT take this as a blanket recommendation to use Type III tests in all cases. It most certainly is not. There are cases where Type III tests would be disasterously wrong. I hope to be able to point those out below.

(Note: You can probably tell that I'm more a Fisherian than a Neyman-Pearson type when it comes to evaluating hypotheses. I'm not going to behave as if an effect isn't there just because the p-value didn't happen to slip below some magical criterion! I'm especially not going to do that when doing it might have a damaging impact on my interpretation of other effects!)


Examples

Before we get started, clear your workspace, and open a script window (File > New Script in Windows; File > New Document on a Mac). Then very carefully type (or copy and paste) the following lines into a script.

aovIII = function (formula, data = NULL) 
{
  options(contrasts = c("contr.sum", "contr.poly"))
  aovIII.out <<- aov(formula, data)
  print(drop1(aovIII.out, . ~ ., test = "F"))
  options(contrasts = c("contr.treatment", "contr.poly"))
}
Now, if you are in your usual working directory (Rspace for me), pull down the File menu for the script window, choose Save As..., and save this script as aovIII.R. (For another way to make a permanent copy of this and other scripts, see the Making R Work For You tutorial. While you're there, you might also want to take the advice about downloading the optional "car" package.) Double check all your typing to make sure it is EXACTLY what you see here. If it is, then close the script window and let's get to some examples.

For these examples, we will look at a very simple, but unbalanced, two-way factorial design. The data are in the MASS library.

> data(genotype, package="MASS")
> ls()
[1] "genotype"
> summary(genotype)
 Litter Mother       Wt       
 A:17   A:16   Min.   :36.30  
 B:15   B:14   1st Qu.:48.20  
 I:14   I:16   Median :54.00  
 J:15   J:15   Mean   :53.97  
               3rd Qu.:60.30  
               Max.   :69.80
The data are from a crossfostering experiment on rats. (Infant rats were switched from one litter to another so that, in some instances, they were not being reared by their biological mother.) The factors, Litter and Mother, give the genotype of the pups and mother respectively, and the response, Wt, gives the litter average weight gain in grams at age 28 days. Data are from Bailey (1953).

When analyzing factorial data, we should always begin by checking for balance. This can be done by crosstabulating the factors.

> with(genotype, table(Litter, Mother))
      Mother
Litter A B I J
     A 5 3 4 5
     B 4 5 4 2
     I 3 3 5 3
     J 4 3 3 5
This will give what amounts to a design table, and in the cells of that table will be the group sizes. As you can see in this case, they range from 2 to 5. The design is not balanced. I should hasten to point out that you CANNOT tell if the design is balanced by looking at the summary() table. I should probably also point out that the following function...
> replications(Wt ~ Litter * Mother, data=genotype)   # output not shown
...will give very similar information. As the design is unbalanced, we now face the issue of what type analysis we should do.

Type I Analysis

This is easy because it's what R gives by default. Notice, however, that the result is different depending upon the order in which the factors are entered into the model formula. For this reason, most people do not like Type I, somewhat unfairly in my opinion.

> aov.out1 = aov(Wt ~ Litter * Mother, data=genotype)
> summary(aov.out1)
              Df Sum Sq Mean Sq F value  Pr(>F)   
Litter         3   60.2   20.05   0.370 0.77522   
Mother         3  775.1  258.36   4.763 0.00574 **
Litter:Mother  9  824.1   91.56   1.688 0.12005   
Residuals     45 2440.8   54.24                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> aov.out2 = aov(Wt ~ Mother * Litter, data=genotype)
> summary(aov.out2)
              Df Sum Sq Mean Sq F value  Pr(>F)   
Mother         3  771.6  257.20   4.742 0.00587 **
Litter         3   63.6   21.21   0.391 0.76000   
Mother:Litter  9  824.1   91.56   1.688 0.12005   
Residuals     45 2440.8   54.24                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case there is not a big difference in the results between the two orders, but that will not always be true. Sometimes the difference will be much more dramatic. (Note: There actually is a "big difference," but it does not appear in the numbers.)

Type II Analysis

The R base packages do not have a function for Type II tests. In this simple, two-factor case, it's not difficult to construct the Type II ANOVA summary table from what we've seen above. Notice that the interaction line and the residuals (error) line are the same in the two outputs. They will also be the same in a Type II analysis. It's the main effects that will be different. (Warning: These statements apply to the two-way design ONLY!)

However, the second main effect in the tables above, Mother in the first table, and Litter in the second table, are essentially Type II tests. If you take the second line from each table, add the interaction and residual lines, you have a Type II ANOVA summary table. This will NOT be the case for more complex designs!

For more complex designs, you should definitely resort to using the optional "car" package, where Type II tests are the default. (The rationale for this is explained by Fox and Weisberg, 2011, see Refs.)

> library("car")             # this will give an error if the package is not installed
The first part of the procedure is the same. Run the ANOVA entering the factors in any order, as order will not matter in the Type II tests. Then use the Anova() extractor function from the "car" package to get the summary table. Take special notice of the upper case A. If you don't capitalize it, you'll get Type I tests again.
> Anova(aov.out1)
Anova Table (Type II tests)

Response: Wt
               Sum Sq Df F value   Pr(>F)   
Litter          63.63  3  0.3911 0.760004   
Mother         775.08  3  4.7632 0.005736 **
Litter:Mother  824.07  9  1.6881 0.120053   
Residuals     2440.82 45                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare this to the result from Anova(aov.out2). The result will be the same. The order in which the factors are listed will be different, but the statistical result is the same.

Type III Analysis

Buckle up! Here we go! There is also no built-in function for Type III tests, and don't expect one anytime soon! To do Type III tests, you must be aware of how you have your factor contrasts set. Treatment contrasts will NOT work.

> options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly" 
Those are treatment contrasts, the R default. To successfully do Type III tests, they must be changed to something else.
> options(contrasts=c("contr.sum","contr.poly"))
> options("contrasts")
$contrasts
[1] "contr.sum"  "contr.poly"
Now (and ONLY now), run the ANOVA. We cannot use the above runs because they were done when the contrasts were set to treatment. Enter the factors in any order. It won't matter to the result. Like Type II tests, Type III tests are not sensitive to the order in which the factors are entered into the model formula. Finally, use the Anova() extractor function from the "car" library to get the summary table. Once again, mind the capital A. Using a lower case A will give Type I tests.
> aov.out3 = aov(Wt ~ Litter * Mother, data=genotype)
> Anova(aov.out3, type=3)    # "car" must be loaded; set type=3
Anova Table (Type III tests)

Response: Wt
              Sum Sq Df   F value  Pr(>F)    
(Intercept)   163782  1 3019.5608 < 2e-16 ***
Litter            28  3    0.1700 0.91612    
Mother           672  3    4.1282 0.01142 *  
Litter:Mother    824  9    1.6881 0.12005    
Residuals       2441 45                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another way to get Type III tests from a relatively simple design (I have not tested this with anything other than between-subjects designs) is to use the custom function we entered at the top of this section. We saved it in the working directory. To get access to it, we have to "source" it in.
> source("aovIII.R")
> ls()
[1] "aov.out1" "aov.out2" "aov.out3" "aovIII"   "genotype"
Notice this has added an additional object, "aovIII", to the workspace. That is our function. Use it like this.
> aovIII(Wt ~ Litter * Mother, data=genotype)
Single term deletions

Model:
Wt ~ Litter * Mother
              Df Sum of Sq    RSS    AIC F value  Pr(>F)  
<none>                     2440.8 257.04                  
Litter         3     27.66 2468.5 251.73  0.1700 0.91612  
Mother         3    671.74 3112.6 265.87  4.1282 0.01142 *
Litter:Mother  9    824.07 3264.9 256.79  1.6881 0.12005  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare this to the previous table to see what you're looking at. This function has added additional clutter to your workspace--an aovIII.out object has been saved. It has also reset your contrasts back to "treatment." One more note. If you now run the Type I and Type II tests on the aovIII.out model object, you'll find that Type I and Type II are quite happy to give you the correct results no matter how the contrasts are set. It is only the Type III tests that are persnickety about contrasts.

WARNING! Here's something I learned the hard way. You may know that the aov() function does not have a problem with dichotomous IVs coded 0 and 1 when a Type I ANOVA is desired. R will figure it out. THIS IS NOT THE CASE with the aovIII() function. Make absolutely certain that any variable being fed to this function on the right side of the tilde is recognized as a factor.

(Okay, still another note: The first line in the "car" output is a test that the intercept is zero, which is rarely of interest.)

  • Bailey, D. W. (1953) The Inheritance of Maternal Influences on the Growth of the Rat. Unpublished Ph.D. thesis, University of California. Table B of the Appendix.

Further Consideration of the "genotype" Data

Notice in all the analyses we did above, the interaction term is always the same. The highest order effect term will always be the same no matter which of these analyses we choose. Thus, I guess one line of advice here might be "pray for an interaction." When the interaction term is significant, lower order effects are rarely of interest.

What if the interaction term is not significant? In the two-way analysis, that means interest is focused on the main effects. But, should we drop the interaction term and redo the analysis without it before looking at main effects? Or should the interaction term be retained even though nonsignificant? Opinions differ. The majority consensus seems to be that nonsignificant interactions should be retained in the analysis. As I tell my students, nonsignificant does not mean nonexistent, and the existence of an interaction, significant or not, will eventually become an issue for us as we explore this topic.

I'm not sure I agree with retaining the interaction. However, for now, I just want to point out one thing. If the interaction term(s) is (are) dropped, so that the model becomes additive (just main effects), Type II and Type III analyses will give the same result. Try it! We'll find out why below.

Concerning Type I tests, if entering the factors in different orders gives different results, then those results surely have different interpretations. Correct! But we're not there yet. We'll get to interpretations below. If you just can't wait, there is an excellent discussion of this very problem in William Venables' Exegeses document. Be aware for now, however, that Type I tests are the ones that least resemble what most people think of as a traditional ANOVA.

  • W. N. Venables (2000). Exegeses on Linear Models. Paper presented to the S-Plus users conference. Washington, D.C. 8th-9th October, 1998.

A Special Case

The "genotype" data did not present us with too very much of a problem, as the same effect, the "Mother main effect", was significant in all the analyses. That's a rather naive way to look at things, especially considering that the two "significant main effects of Mother" that we saw in the Type I analyses were not the same effect! But I'll let that slide for now.

Suppose we have the following situation. If I remember, it was the mid 1970's when Hans Eysenck introduced what has come to be called the PEN model of human personality. PEN stands for psychoticism, extraversion, neuroticism, which are supposedly three major dimensions of personality, kind of like a 3D coordinate system within which everyone has a point determined by his or her scores on three scales. We're interested in his psychoticism scale, which supposedly measures, among possibly other things, a person's susceptiblity to developing a psychotic disorder such as schizophrenia.

We might imagine that such susceptibility is a function of several variables. (I.e., you don't go crazy because of just one thing.) We're going to investigate two of those variables: genetic predisposition (do you have craziness among your immediate relatives, yes/no?); and childhood trauma (were you abused, bullied, deprived, a victim of loss, etc., as a child, yes/no?).

Suppose we know the following facts from previous research. 1) 10% of people in the general population have a genetic predisposition, 90% do not. 2) 20% of people in the general population (of adults) experienced childhood trauma as we've defined it, 80% did not. 3) Experiencing childhood trauma does not make you any more or less likely to be genetically predisposed, and being genetically predisposed does not make you any more or less likely to experience childhood trauma. Thus, genetic predisposition and childhood trauma are independent of each other. (IMPORTANT NOTE: I'm not saying any of this is factual. I'm making this up for the sake of this example. I can easily image how fact 3, especially, could well be wrong!)

Furthermore, suppose we also know the following. 1) In the general population, scores on the psychoticism scale have a possible range of 40-160, with a mean in the mid 90s and standard deviation of around 20. (I have NO IDEA if this is correct, but it doesn't matter for the sake of this example.) 2) The base level on the psychoticism scale for people with no predisposition and no trauma is, on the average, a score of 90 (ish). 2) The presence of childhood trauma adds 10 points to that. 3) The presence of a genetic predisposition adds 15 points to that.

Here's what we don't know. What is the effect of having both trauma and predisposition? Is it additive? I.e., do the effects of the two factors simply add together to give the effect of both? In that case, the combined effect should be 25 points. Or do the two factors somehow interact with one another to make their combined effect greater than 25 points? Off to the grant writing workshop we go!

Okay, we're funded, and we've taken a random sample of 100 people from the general population. We've cross-classified them on the two factors of interest to us, and we've come up with the following table. The numbers in the cells are obtained sample sizes, n.

> xtabs(~ trauma + genetic, data=PEN)       # these data do not exist!
      genetic
trauma no yes
   no  72   8
   yes 18   2
These are exactly the cell frequencies we would expect if the two factors, trauma and genetic, are independent, according to the multiplication rule for independent events. And they came out that way because I made them up and made them come out that way! Such a design is said to be "proportionally balanced." The second column has 1/9th the number of subjects in the first column, and that's true in all rows. The second row has 1/4th the number of subjects in the first row, and that's true in all columns.

Since I made the cell frequencies come out perfectly, I might as well make the cell means and standard deviations come out perfectly as well. Here are the overall results.

 genetic: nogenetic: yes
trauma: noM = 90
sd = 20
n = 72
M = 105
sd = 20
n = 8
trauma: yesM = 100
sd = 20
n = 18
M = 115
sd = 20
n = 2

All of the ANOVAs can be calculated from this information (in the 2x2 case only; you'll learn how below). Here they are.

Type I: trauma first
               Df Sum Sq Mean Sq F value Pr(>F)  
trauma          1   1600    1600   4.000 0.0483 *
genetic         1   2025    2025   5.062 0.0267 *
trauma:genetic  1      0       0   0.000 1.0000  
Residuals      96  38400     400
Totals         99  42025

Type I: genetic first
               Df Sum Sq Mean Sq F value Pr(>F)  
genetic         1   2025    2025   5.062 0.0267 *
trauma          1   1600    1600   4.000 0.0483 *
genetic:trauma  1      0       0   0.000 1.0000  
Residuals      96  38400     400
Totals         99  42025

Type II
               Df Sum Sq Mean Sq F value Pr(>F)  
trauma          1   1600    1600   4.000 0.0483 *
genetic         1   2025    2025   5.062 0.0267 *
trauma:genetic  1      0       0   0.000 1.0000  
Residuals      96  38400     400
Totals         99  42025

Type III
               Df Sum Sq Mean Sq F value Pr(>F)  
trauma          1    576     576    1.44 0.2311
genetic         1   1296    1296    3.24 0.0750
trauma:genetic  1      0       0    0.00 1.0000  
Residuals      96  38400     400
Totals         99  -----
What have we learned? First, order no longer matters for the Type I tests. Even though this design is unbalanced (proportionally balanced is not balanced), order doesn't matter. Second, the Type I and Type II tests give the same result. Third, the Type III tests give a different result. Fourth, the total sum of squares, which is 42,025, is correctly partitioned by the Type I and Type II tests but not by the Type III tests. This is unusual in that Type II tests, like Type III, do not usually have this property when the design is unbalanced. (Type I always partitions correctly.)

The tests lost quite a bit of power due to the unbalanced nature of the design. With a balanced design, if that had been possible, these main effects would have been found with as few as 16 subjects per cell (64 subjects total). Nevertheless, the Type I and Type II tests still found the main effects.

We have to wonder why the Type III tests didn't find the main effects. Wonder no longer. I'll show you. Notice this is the additive case in which the interaction is exactly zero. Let's remove it from the model and see what happens.

Type I, Type II, and Type III
            Df Sum Sq Mean Sq F value Pr(>F)  
trauma       1   1600  1600.0   4.042 0.0472 *
genetic      1   2025  2025.0   5.115 0.0259 *
Residuals   97  38400   395.876
And presto, Type III has found the main effects! In fact, now all three types are giving the same sums of squares and the same tests. How can dropping a term that accounts for exactly none of the total variability have such an effect? I point out that the sums of squares for the main effects have not changed in the Type I and II analyses. The error mean square has changed a bit because we've dropped another degree of freedom into the error term, but other than that everything is the same, as you would certainly expect after dropping a term that accounts for zero variability! But suddenly the Type IIIs are getting it right. Somehow, including a zero interaction in the analysis prevented Type III from evaluating the main effects "correctly."

The following table shows the trauma main effect as it is being tested by each of the three types of tests (with the interaction included in the model, which is only relevant in the Type III case).

 Type IType IIType III
trauma: noM = 91.5
n/row = 32
M = 91.5
n/row = 32
M = 97.5
n/row = 11.52
trauma: yesM = 101.5
n/row = 32
M = 101.5
n/row = 32
M = 107.5
n/row = 11.52
SS16001600576

In each case, the sum of squares is calculated as the mean difference squared times the effective row size divided by 2. If the design were balanced, all three types would be testing the same effect. With proportional balance, Type I and Type II test the same effect, but Type III tests a different effect (same size mean difference but different means and difference in effective row size). When the interaction is deleted from the model, the main effects being tested by Type I and Type II don't change (although the error term will change a bit). However, when the interaction is deleted, the main effects being tested by Type III do change, and it's not easy to predict how they will change. This derives from the fact that in Type I and Type II tests the main effects and the interaction are orthogonal, while in Type III tests they are not (e.g., see Herr and Gaebelein, 1978). Thus, in Type III tests, if the interaction is deleted from the model, even if it is a zero interaction, the effects being tested as main effects will change somewhat unpredictably.

I'll point out one more thing. Notice that the row marginal means being tested by Type I and Type II are the weighted means. This is typical for the first factor entered into a Type I test, but is not typical for Type II, where the weighting scheme is usually considerably stranger. Thus, when the design is proportionally balanced, Type I and Type II tests become the same, but that's not because Type I becomes Type II. It's because Type II becomes Type I.

Those were additive main effects. The interaction was zero. It's interesting to see what happens when we add a weak and a strong interaction. Type I and Type II tests again give identical results, so only Type II and Type III are shown. The weak interaction on the left was created by adding 7 points to the mean of the lower right cell. The strong interaction on the right was created by adding 27 points (i.e., an additional 20 points).

Type II vs III

Let's look at the case of the weak interaction first.

We have created this interaction by taking the largest of the four means and making it even larger. Thus, in the ANOVA way of thinking, not only have we created an interaction, but we have also increased the magnitude of the main effects. (Think of the profile plot.) This is reflected in the increased size of the sums of squares. Notice that the Type III sums of squares are still smaller than they were above when there was no interaction and the main effects were smaller. If we drop the tiny little interaction term from the model, as if by magic all three analyses will give the same result.

Type I, Type II, and Type III
            Df Sum Sq Mean Sq F value Pr(>F)  
trauma       1   1832  1831.8   4.619 0.0341 *
genetic      1   2421  2420.6   6.103 0.0152 *
Residuals   97  38471   396.6
Notice that the sums of squares for the main effects haven't changed for the Type I and Type II tests. Why should they? We haven't changed the sizes of the main effects in the data by merely dropping a term from the analysis!

In the case of the strong interaction, the interaction is now of the same order of magnitude as the main effects, but still has not reached statistical significance (because the unbalanced design lacks the power we need to see it). What happens if we drop it?

Type I, Type II, and Type III
            Df Sum Sq Mean Sq F value  Pr(>F)   
trauma       1   2581    2581   6.345 0.01340 * 
genetic      1   3745    3745   9.209 0.00309 **
Residuals   97  39450     407
Once again, I point out that, as large as it is, dropping the interaction term from the analysis does not change the size of the main effects in the data! Well, at least not according to the Type I and Type II tests. The Type III tests are now seeing the main effects as being substantially smaller.

These interactions were created by increasing the values of data points in a cell that had only 2 of the 100 subjects in the study, in a column that had only 10 of 100, and in a row that had only 20 of 100. These two data values determine 1 of our 4 cell means and, therefore, must be considered influential points. (Cook's D=.25 for both points, the highest values by far of any of the Cook's distances. Case 98, which is an outlier in the upper right cell has a Cook's D=.15.) The response of the Type III analysis to changing these values can politely be described as "exaggerated." (If I wanted to be somewhat less polite, I would use the word "goofy.")

The Type III tests failed to find the main effects when the effects were additive and produced an exaggerated response to the creation of a nonadditive effect by changing the values of two highly influential points. I.e., the tests are not resistant to changes in influential values. This is because, as we'll see, the Type III tests weight all of the cell means equally. Thus, changing the values of two data points in a cell that has only 2, or in a column that has only 10, has a much greater influence on the analysis than changing the values of two data points in a cell that has 72 values would have.

I'm going to commit an R heresey and say that it makes perfect sense (in my humble opinion) to weight the cell means equally when the cell frequencies are meaningless and the cell frequencies are not too different. This might happen in a randomized design with modest and random subject attrition. Weighting the cell means equally does not make sense when the cell frequencies are meaningful and disparate.

Furthermore, changing the model from nonadditive to additive, which has no effect on the size of the main effects in the data, caused the Type III tests to change the evaluation of the main effects considerably, while the Type I and Type II tests changed hardly at all in their evaluation of the main effects.

Notes

Herr and Gaebelein (1978) suggest that dropping the interaction term in a Type III analysis is not justified, because unlike in the Type I and Type II case, the interaction sum of squares is not orthogonal to the main effects sums of squares. "Because the sums of squares for rows and for columns are orthogonal to the interaction sum of squares in EAD [N.B.-Type II], HRC, HCR [N.B.-hierarchical rows then columns, hierarchical columns then rows, Type I], and WTM [N.B.-weighted means analysis for both main effects], it would be possible to pool the interaction sum of squares with error in these analyses if it were felt that there was no interaction. This would not be possible in the STP [N.B.-Type III] analysis, since the interaction sum of squares is not orthogonal to either the sum of squares for rows or that for columns when the cell sizes are unequal" (p. 212). For another take on the analysis of proportional cell data, see Gocka (1973), as well as the various Overall and Spiegel articles, responses to them, and responses to the responses.

  • Gocka, E. F. Regression analysis of proportional cell data. Psychological Bulletin, Vol 80(1), Jul 1973, 25-27.
  • Herr, D. G. and Gaebelein, J. W. Nonorthogonal two-way analysis of variance. Psychological Bulletin, Vol 85(1), Jan 1978, 207-216.
  • Overall, J. E. and Spiegel, D. K. Concerning least squares analysis of experimental data. Psychological Bulletin, Vol 72(5), Nov 1969, 311-322.
  • Overall, J. E. and Spiegel, D. K. Comment on "Regression analysis of proportional cell data." Psychological Bulletin, Vol 80(1), Jul 1973, 28-30.

Another Example: The Nature of the Problem

In the previous section we saw an example of a "proportionally balanced" design, which was unbalanced, but which, nevertheless, had independent factors. (A chi square test on the cell frequencies would give a chi square of 0.) We saw that the Type III tests worked too hard at correcting a confound between the factors that really didn't exist. The following example shows what happens when the factors are clearly confounded.

Many data sets, both real and artificial, have been offered up in various textbooks and journal articles (e.g., the salary and gender bias data and the psychotherapy for depression data in Maxwell & Delaney). The problem is perhaps most straightforwardly illustrated by an artificial data set treated by Howell and McConaughy (1982). These data, to which I have added a fourth variable to be explained below, are shown here. (These data are not available in the data sets file that accompanies these tutorials, so if you want them, copy and paste!)

### begin copying here
HM = read.table(header=T, text="
Stay   Patient Hospital  Age
   2 Obstetric        I 23.0
   2 Obstetric        I 27.8
   2 Obstetric        I 23.7
   3 Obstetric        I 24.1
   3 Obstetric        I 20.2
   3 Obstetric        I 24.6
   3 Obstetric        I 26.1
   4 Obstetric        I 23.2
   4 Obstetric        I 19.3
   4 Obstetric        I 24.2
   2 Obstetric       II 33.3
   2 Obstetric       II 26.5
   2 Obstetric       II 28.2
   3 Obstetric       II 21.4
   4 Obstetric       II 25.0
  20 Geriatric        I 76.3
  21 Geriatric        I 76.2
  21 Geriatric        I 67.2
  20 Geriatric        I 70.7
  19 Geriatric       II 77.6
  22 Geriatric       II 66.4
  22 Geriatric       II 68.0
  21 Geriatric       II 71.0
  20 Geriatric       II 71.8
  20 Geriatric       II 75.9
  23 Geriatric       II 80.3
  22 Geriatric       II 71.3
  21 Geriatric       II 75.9
  21 Geriatric       II 79.4
  20 Geriatric       II 74.1
  21 Geriatric       II 82.6
")
### end copying here and paste into the R Console
At issue is the quality of care received in two hospitals (variable "Hospital" with levels I and II) by two kinds of patients (variable "Patient" with levels Obstetric and Geriatric). The dependent variable is "Stay" (length of stay in the hospital measured in days). Added to these variables from Howell and McConaughy is a fourth variable, "Age" of the patient, created by me in a (hopefully) reasonable semi-random fashion. (Note: I'm not sure whether higher quality of care should be indicated by decreased length of stay or increased length of stay, but that's an argument for another day!)

Even a casual inspection of the data will illustrate the nature of the problem. Geriatric patients stay a lot longer (M = 20.9 days, SD = 1.02) than obstetric patients do (M = 2.9 days, SD = 0.83), and Hospital I cares mostly for obstetric patients (n = 10) and relatively few geriatric patients (n = 4), while for Hospital II it's just the reverse (n = 5 obstetric patients, n = 12 geriatric patients). Thus, "Hospital" is confounded with the kind of "Patient" being routinely seen, and straightforward calculation of mean Stay by Hospital presents a biased picture (M = 8.0 days, SD = 8.24 for Hospital I, versus M = 15.6 days, SD = 8.70 for Hospital II). Admittedly, these summary statistics are of questionable value due to the highly bimodal distribution of data within hospitals, but let's continue naively for a moment, performing a t-test to find that the difference in stay length is statisically significant, t(29) = 2.47, p = .02, two-tailed. The difference is quite large and "unmistakable" by the traditional Cohen's d criterion, d = 0.9.

Clearly, Hospital II is being handed the unpleasant end of the stick in this analysis due to the relatively large percentage of geriatric patients cared for there. If we had no other information than what's in the variables Hospital and Stay, we'd be up a statistical creek at this point. A casual walk through the patient care wards of the two hospitals would surely tell us that we have a problem with confounding, but we'd have no data we could use in an attempt to remove this confound statistically.

Fortunately, we have the Age variable, which we can use as a covariate. The result of a regression analysis with both Age and Hospital entered as predictors is shown next. The Age x Hospital interaction was nonsignificant, p = .6, so as is often done in analysis of covariance (ANCOVA), it has been removed from the analysis.

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.52563    0.80740  -6.844 1.95e-07 ***
Age          0.35959    0.01612  22.313  < 2e-16 ***
HospitalII  -0.64541    0.80923  -0.798    0.432
We see that each additional year of Age results in (is associated with) a 0.4 day longer Stay, which is significantly different from 0 (H0: no effect of Age), p < .001, but the difference in average length of Stay (0.6 days less in Hospital II) is no longer statistically significant, p = .4. Adjusted mean Stay in each Hospital can be calculated by the usual method of using the regression equation twice and substituting overall mean Age (50.2 years):
adjusted mean Stay for Hospital I = -5.53 + 0.360 * 50.2 = 12.5 days
adjusted mean Stay for Hospital II = -5.53 + 0.360 * 50.2 - 0.65 = 11.9 days
(Note: Before we rush on, I'd like to point out something that may seem almost like an aside at this juncture, but will turn out eventually to be quite important. Note that there are no patients in our sample with an age anywhere near the overall mean of 50.2. The overall mean age falls in the middle of a very dry well in our bimodal distribution of ages. With considerable justification, many of you are probably thinking that by calculating the adjusted means as we just did, we have made an unwarranted extrapolation from our data. Hold that thought!)

Suppose the Age variable was not available, as it was not in the original data from Howell and McConaughy. Now we must rely on a 2x2 unbalanced factorial ANOVA to get at the truth (and perhaps rightly so given the strongly bimodal nature of the Age variable). We begin with an examination of the relevant means and group sizes.

                     Hospital I    Hospital II
-----------------------------------------------
Geriatric patients     M = 20.5      M = 21.0
                      SD =  0.58    SD =  1.13
                       n =  4        n = 12
-----------------------------------------------
Obstetric patients     M =  3.0      M =  2.6
                      SD =  0.82    SD =  0.89
                       n = 10        n =  5
-----------------------------------------------
We can see that, in this sample, Geriatric patients stay a smidge longer in Hospital II, while Obstetric patients stay a smidge longer in Hospital I. Neither of these differences is likely to be meaningful, much less significant, given the sizes of the standard deviations.

Before we proceed to the inferential tests, however, let me make a few relevant points about the design of this study. First, no subjects have been randomly assigned to any of the conditions of this study. It is an observational study, and both explanatory variables are quasi-independent variables. We might presume that the cell sizes are representative of the sizes of the sampled populations, although there is no guarantee of this and no way to tell from the data. In any event, the cell sizes are not different at random or by accident. Second, this is the simplest factorial design we can have, a 2x2 design with both variables being tested between groups. I.e., both variables will be tested on a single degree of freedom, which is not important in principle but turns out to be relevant in the calculations that follow. Third, we can see that we are unlikely to find a signficant interaction in these data, but, as I repeatedly try to hammer into my students, nonsignificant does not mean nonexistent! Perhaps more importantly, there seems (to me) to have been no way to predict a priori whether we should have expected to see an interaction or not.

At long last, here are the results of the three different types of analyses. It is relevant to the Type I tests only that Patient was entered first and Hospital second.

Source                    SS        df      MS          F        p
--------------------------------------------------------------------
Patient: Type I         2510.71      1    2510.71     2801.2   <.001
         Type II        2068.65      1    2068.65     2308.0   <.001
         Type III       2034.96      1    2034.96     2270.4   <.001
Hospital: Types I & II     0.0044    1       0.0044      0.005  .945
          Type III         0.0158    1       0.0158      0.018  .895
Patient x Hospital         1.279     1       1.279       1.427  .243
Error                     24.2      27       0.8963
--------------------------------------------------------------------

Notice as this design is neither balanced nor proportionally balanced, the Type I and Type II tests (on the first factor entered) are now different. The Type II test is now much closer to Type III than it is to Type I. As we did in the previous section, it might be interesting to look at the actual Patient effect being tested by each type (Patient entered first and interaction in the model).

 Type IType IIType III
Geriatric patientsM = 20.875M = 20.776M = 20.750
Obstetric patientsM = 2.867M = 2.779M = 2.800
effective row sizen = 15.484n = 12.773n = 12.632
SS2510.72068.62035.0

We see that all three tests show large sums of squares and substantial significance (if there is such a thing) in length of Stay by Patient type. Who's surprised? There is no significant effect of Hospital on Stay length, and the interaction is also not significant. In fact, all three types of analysis have yielded similar results.

In this case, this is true ONLY because in the Type I tests we entered Patient first and Hospital second. I would argue, given the hypothesis we are interested in, that no one in his right mind would do those Type I tests the other way around. But, rarely being accused of being in my right mind, I'm going to do it anyway!

> summary(aov(Stay ~ Hospital * Patient, data=HM))
                 Df Sum Sq Mean Sq  F value Pr(>F)    
Hospital          1  442.1   442.1  493.225 <2e-16 ***
Patient           1 2068.6  2068.6 2307.985 <2e-16 ***
Hospital:Patient  1    1.3     1.3    1.427  0.243    
Residuals        27   24.2     0.9                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now the association of Hospital with Stay length is highly significant. What the heck could that possibly mean? Before we find out what the heck it means, I want to show one more analysis, with Age entered first as a numeric covariate, followed by Patient, followed by Hospital, followed by Patient x Hospital interaction. ("WAIT! This is ridiculous! 'Patient' is just a proxy for 'Age'. You should pick one and leave the other out." I disagree.) Here are the Type I tests.
> summary(aov(Stay ~ Age + Patient * Hospital, data=HM))
                 Df Sum Sq Mean Sq  F value   Pr(>F)    
Age               1 2422.2  2422.2 2820.192  < 2e-16 ***
Patient           1   90.4    90.4  105.284 1.24e-10 ***
Hospital          1    0.2     0.2    0.276    0.604    
Patient:Hospital  1    1.0     1.0    1.215    0.280    
Residuals        26   22.3     0.9                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So what's the difference between Type I, Type II, and Type III analyses, and why is Type I so weird? Spoiler alert! It's only the ANOVA people who think Type I is weird. Regression people probably see already what Type I tests are doing.

  • Howell, D. C. and McConaughy, S. H. (1982). Nonorthogonal analysis of variance: Putting the question before the answer. Educational and Psychological Measurement, 42(1), 9-24.

What The Heck Could It Possibly Mean?

The Regression Solution. No! I am not going to show the least squares solution to unbalanced ANOVA. I'll refer you to any advanced textbook on ANOVA for that. Above, I mentioned that in multiple regression analysis it is par for the course to have correlated, and therefore confounded, predictors. How does regression analysis deal with that?

I have this theory about gas mileage of cars. I think it's weight, and weight only, that determines gas mileage. Once weight is controlled (held constant, or however you want to say it), I don't think horsepower or engine displacement matters a bit. I have some data (admittedly from the 1970's) that I'm going to use to test this theory.

> cars = mtcars[, c(1,3,4,6)]     # mtcars is a built-in data set
> summary(cars)
      mpg             disp             hp              wt       
 Min.   :10.40   Min.   : 71.1   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :196.3   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :230.7   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :472.0   Max.   :335.0   Max.   :5.424  

> round(cor(cars), 3)             # r = .349 is signicant at alpha = .05, two-tailed
        mpg   disp     hp     wt
mpg   1.000 -0.848 -0.776 -0.868
disp -0.848  1.000  0.791  0.888
hp   -0.776  0.791  1.000  0.659
wt   -0.868  0.888  0.659  1.000

> round(summary(lm(mpg ~ wt + hp + disp, data=cars))$coef, 3)   # the regression analysis
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   37.106      2.111  17.579    0.000
wt            -3.801      1.066  -3.565    0.001
hp            -0.031      0.011  -2.724    0.011
disp          -0.001      0.010  -0.091    0.929
Are we done here? It looks like weight is a highly significant predictor of gas mileage, just as I said, but horsepower is also significant. It looks like I was right about displacement.

Wrong! We have not tested my theory as stated! In a standard regression analysis, each predictor is treated as if it is entered LAST into the regression analysis. In this analysis, we are looking at weight AFTER (or controlled for) horsepower and displacement, horsepower after weight and displacement, and displacement after weight and horsepower. That's usually what we want from a regression analysis, but not always, and not this time!

I said "it's all weight." After weight is controlled, horsepower and displacement don't matter. That means look at weight first by itself, and we can see that it is the predictor that has the strongest correlation with mpg, r = -.868, although displacement is not far behind. Take weight out first, and THEN look at horsepower and displacement. That's a hierarchical regression.

Without going to a whole lot of trouble, we can get something close to that as follows.

> print(summary(aov(mpg ~ wt + hp + disp, data=cars)), digits=6)
            Df  Sum Sq Mean Sq  F value     Pr(>F)    
wt           1 847.725 847.725 121.7305 1.0521e-11 ***
hp           1  83.274  83.274  11.9579   0.001758 ** 
disp         1   0.057   0.057   0.0082   0.928507    
Residuals   28 194.991   6.964
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The usual interpretation requires us to have the total sum of squares, which R hasn't provided. There are more than two ways to get it, but here are two.
> SS = function(x)  sum(x^2) - sum(x)^2 / length(x)   # create a function to calculate it
> SS(cars$mpg)
[1] 1126.047
> 847.725 + 83.274 + 0.057 + 194.991                  # sum the SSes in the ANOVA table
[1] 1126.047
The aov() function in R correctly partitions the total sum of squares. And it does SEQUENTIAL tests. So when weight (wt) is entered first (by itself, alone, ignoring hp and disp), the null hypothesis is knocked right out of the park. Weight alone yields an R-squared value of 847.725 / 1126.047 = 0.753 (which is -.868^2). This is highly significant, yielding F = 121.73 on df = 1 and 28, and a p-value so tiny it isn't worth writing out. (NOTE: See the tutorial on Hierarchical Regression for a discussion of whether these tests are being done with the correct error term.)

Entering horsepower (hp) after weight increases R-squared by 83.274 / 1126.047 = 0.074, which doesn't seem like much, but it is significant, F = 11.96 on df = 1 and 28, p = .002. Entering displacement (disp) after weight and horsepower adds virtually nothing to R-squared.

But hold on a minute! Why did horsepower get to go second? Why didn't displacement get to go second?

> print(summary(aov(mpg ~ wt + disp + hp, data=cars)), 6)
            Df  Sum Sq Mean Sq   F value     Pr(>F)    
wt           1 847.725 847.725 121.73047 1.0521e-11 ***
disp         1  31.639  31.639   4.54332   0.041962 *  
hp           1  51.692  51.692   7.42277   0.010971 *  
Residuals   28 194.991   6.964                         
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Entering displacement second (after weight but before horsepower) results in a small but significant increase in R-squared of 31.639 / 1126.047 = 0.028. Then adding horsepower after weight and displacement results in another significant bump in R-squared.

Which should we believe? I didn't specify in my "theory" which should be entered second and third. All I said was after weight, horsepower and displacement add nothing. The only fair thing to do would be to add them in together. In which case, R-squared would be increased by (31.639 + 51.692) / 1126.047 = 0.074. This would be tested on df = 2 and 28, so MS = (31.639 + 51.692) / 2 = 41.6655, giving an F = 41.6655 / 6.964 = 5.983, and...

> pf(5.983, 2, 28, lower=F)
[1] 0.006863456
... p = .007. Looks like I was wrong. Dang! I hate to be wrong!

This is what the aov() function does. It does sequential or hierarchical tests. In the summary table, an effect is controlled for any effect that occurs ABOVE it in the table, but no effect that occurs BELOW it in the table. In ANOVA lingo, these are Type I tests.

(A final note on cars: At least one of the interactions should be included in this model. I don't know which one. I didn't look. I left out the interactions to keep things simple.)

The ANOVA Interpretation. In the cross-fostering study, we saw these two ANOVA tables.

              Df Sum Sq Mean Sq F value  Pr(>F)   
Litter         3   60.2   20.05   0.370 0.77522   
Mother         3  775.1  258.36   4.763 0.00574 **
Litter:Mother  9  824.1   91.56   1.688 0.12005   
Residuals     45 2440.8   54.24

              Df Sum Sq Mean Sq F value  Pr(>F)   
Mother         3  771.6  257.20   4.742 0.00587 **
Litter         3   63.6   21.21   0.391 0.76000   
Mother:Litter  9  824.1   91.56   1.688 0.12005   
Residuals     45 2440.8   54.24
I glossed over the difference between them, but now it should be obvious. The "main effect of Mother" is significant in both tables with p = .006, but they are NOT the same effect of Mother. The first table shows the effect of Mother after (controlled for, removing, holding constant) Litter. The second one shows the effect of Mother before (ignoring, uncontrolled for) Litter. Which of these do we want? I don't know! What was the hypothesis?

It's easier to decide in the case of the hospital analysis.

                 Df Sum Sq Mean Sq  F value Pr(>F)    
Patient           1 2510.7  2510.7 2801.206 <2e-16 ***
Hospital          1    0.0     0.0    0.005  0.945    
Patient:Hospital  1    1.3     1.3    1.427  0.243    
Residuals        27   24.2     0.9

                 Df Sum Sq Mean Sq  F value Pr(>F)    
Hospital          1  442.1   442.1  493.225 <2e-16 ***
Patient           1 2068.6  2068.6 2307.985 <2e-16 ***
Hospital:Patient  1    1.3     1.3    1.427  0.243    
Residuals        27   24.2     0.9
I think it's pretty obvious that we want the first analysis here. Patient and Hospital were confounded, and we want to look at the Hospital effect after controlling for Patient. After taking out the effect of Patient, Hospital accounts for virtually nothing in terms of variability in Stay length.

Which effect of Patient should we look at? Who cares?! We know what the effect of Patient is! We're just including it as a control. I'm not even going to look at it (unless I have to look at it as part of a significant interaction). I do have an interesting question about Patient, however. If I take out the effect of Age first, is there any variability in Stay length left over for Patient to explain? Or is Patient merely a proxy for Age?

                 Df Sum Sq Mean Sq  F value   Pr(>F)    
Age               1 2422.2  2422.2 2820.192  < 2e-16 ***
Patient           1   90.4    90.4  105.284 1.24e-10 ***
Hospital          1    0.2     0.2    0.276    0.604    
Patient:Hospital  1    1.0     1.0    1.215    0.280    
Residuals        26   22.3     0.9
Taking out the effect of Age first doesn't leave much for anything else to explain. The effect of Patient is a mere wisp of it's former self. But it's still significant. If these were real data (they are not), that would suggest to me that there is something about the difference between obstetric and geriatric patients in addition to their age that is determining length of hospital stay. The effect is small though, R-squared = 0.036. (Note: In ANOVA, R-squared and the more traditional eta-squared measure of effect size are the same thing.)

Now, tell me how we would get that information from Type II or Type III tests. Answer: We wouldn't!

To summarize, the following Type I analysis...

DV ~ A * B, or equivalently, DV ~ A + B + A:B

... does these tests. Factor A is entered first and grabs all the variability it can get, uncontrolled for or ignoring B and the A:B interaction. Then factor B is entered and allowed a chance at whatever variability is left. That is, we are testing the effect of B after A, or removing A, or controlling for A, or holding A constant. We are not controlling either of these main effects for the interaction effect. Finally, the interaction is entered, and it is controlled for both A and B effects. Once again, in a Type I summary table, each effect is controlled for everything above it in the table and for nothing below it in the table. It is a series of sequential, nested tests. First A, then A + B, and finally A + B + A:B.

We could do a similar thing as follows.

> Model1 = aov(Stay ~ 1, data=HM)      # no predictors, just an intercept
> Model2 = aov(Stay ~ Patient, data=HM)
> Model3 = aov(Stay ~ Patient + Hospital, data=HM)
> Model4 = aov(Stay ~ Patient + Hospital + Patient:Hospital, data=HM)
> anova(Model1, Model2, Model3, Model4)
Analysis of Variance Table

Model 1: Stay ~ 1
Model 2: Stay ~ Patient
Model 3: Stay ~ Patient + Hospital
Model 4: Stay ~ Patient + Hospital + Patient:Hospital
  Res.Df     RSS Df Sum of Sq         F Pr(>F)    
1     30 2536.19                                  
2     29   25.48  1   2510.71 2801.2056 <2e-16 ***
3     28   25.48  1      0.00    0.0049 0.9447    
4     27   24.20  1      1.28    1.4269 0.2427    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using this method, we might switch Model1 and Model2, depending upon what we are trying to find out, but notice how silly it would be to do the tests in any other order!

Type II Tests. Can't decide what order to put A and B in? Then do Type II tests. Type II tests treat all of the main effects equally, as if each main effect were the last one to be entered. Thus, all of the main effects are controlled for confounding with all of the other main effects, but the main effects are not controlled for the interaction effects.

Type III Tests. Type III tests are the ones that most resemble a standard regression analysis, in that every effect is treated as if it were entered last. Everything in the table is controlled for confounding with everything else in the table. Thus, all of the main effects are treated equally in that all of the main effects are controlled for confounding with all of the other main effects, but in Type III, the main effects are also controlled for confounding with the interactions. I.e., the interactions are also taken out or removed before the main effects are tested. And therein lies a rather vigorous debate!

Terminology Issues. If you go to the literature, you'll discover that the SAS terminology is by no means universal, especially in older pubs before the SAS terminology was developed. The following table shows some of the other common names for these analyses, along with a summary in the last column of what the tests do, where the vertical bar should be read as "given" or "after" or "removing" or "controlling for".

           Overall &    Herr &
SAS        Spiegel      Gaebelein   Others (e.g., Macnaughton)   Tests
------------------------------------------------------------------------------------------------
Type I     Method III   HRC, HCR    hierarchical or sequential   A, B|A, AxB|A & B
Type II    Method II    EAD         higher terms omitted (HTO)   A|B, B|A, AxB|A & B
Type III   Method I     STP         higher terms included (HTI), A|B & AxB, B|A & AxB, AxB|A & B
                                      partial, marginal
------------------------------------------------------------------------------------------------

  • Appelbaum, M. I. and Cramer, E. M. Some problems in the nonorthogonal analysis of variance. Psychological Bulletin, Vol 81(6), Jun 1974, 335-343.
  • Carlson, J. E. and Timm, N. H. Analysis of nonorthogonal fixed-effects designs. Psychological Bulletin, Vol 81(9), Sep 1974, 563-570.
  • Hector, A., von Felten, S., and Schmid, B. Analysis of variance with unbalanced data: An update for ecology and evolution. Journal of Animal Ecology, Vol 79(2), Mar 2010, 308-316.
  • Herr, D. G. and Gaebelein, J. W. Nonorthogonal two-way analysis of variance. Psychological Bulletin, Vol 85(1), Jan 1978, 207-216.
  • Langsrud, O. ANOVA for unbalanced data: Use Type II instead of Type III sums of squares. Statistics and Computing, Vol 13(2), Apr 2003, 163-167.
  • Macnaughton, D. B. PDF of a paper presented at the Joint Statistical Meetings, Boston, 1992.
  • Overall, J. E. and Spiegel, D. K. Concerning least squares analysis of experimental data. Psychological Bulletin, Vol 72(5), Nov 1969, 311-322.
  • Overall, J. E. and Spiegel, D. K. Comment on "Regression analysis of proportional cell data." Psychological Bulletin, Vol 80(1), Jul 1973, 28-30.
  • Overall, J. E., Spiegel, D. K., and Cohen, J. Equivalence of orthogonal and nonorthogonal analysis of variance. Psychological Bulletin, Vol 82(2), Mar 1975, 182-186.

How Do You "Control" For Something?

Short answer: you hold it constant. Here is a summary of the hospital data again, except this time I've given sums of squares in the cells instead of standard deviations. (All of the values in this table are exact, i.e., unrounded.) At the bottom of the columns I've added two sets of means, the weighted and the unweighted marginal means.

                     Hospital I    Hospital II
-----------------------------------------------
Geriatric patients     M = 20.5      M = 21.0
                      SS =  1.0     SS = 14.0
                       n =  4        n = 12
-----------------------------------------------
Obstetric patients     M =  3.0      M =  2.6
                      SS =  6.0     SS =  3.2
                       n = 10        n =  5
-----------------------------------------------
weighted means         M = 8         M = 15 10/17
unweighted means       M = 11.75     M = 11.8
If we want to look at the effect of Hospital controlling for Patient, we don't ignore Patient. Rather, we compare hospitals while looking inside of each level of Patient. Geriatric patients stayed half a day longer on average in Hospital II. Obstetric patients stayed four tenths of a day longer in Hospital I. Neither of those differences is likely to mean very much. We have removed the influence of Patient by comparing hospitals only within comparable levels of Patient.

On the other hand, if we just blindly calculate (weighted) means at the bottom of the columns, we are ignoring the Patient factor, treating it as if it doesn't even exist, and then we end up in trouble with the confound. Patients in Hospital II stayed more than seven and a half days longer than patients in Hospital I, on average.

Weighted means are weighted by the cell sizes, and that's what's causing the confounding here, the unequal cell sizes (unbalanced design). What if we calculate column means by weighting the cell means equally. Such means are called unweighted means. (Some people prefer to call them by the more logical name equally weighted means.) By doing this, we remove the effect of the unequal cell sizes and, thereby, remove the confound created by the unbalanced design. As you can see, the unweighted column means are almost identical.

We calculated the unweighted means by making ALL the cell weights equal. Any value will give the same answer, so the easiest way would be to use 1, but just to make our lives interesting, let's use 6.315789. The column means are:

Hospital I: (20.5 * 6.315789 + 3.0 * 6.315789) / (6.315789 + 6.315789) = 11.75
Hospital II: (21.0 * 6.315789 + 2.6 * 6.315789) / (6.315789 + 6.315789) = 11.8

We would then use the same weights in calculating the unweighted row means. But to control Hospital for Patient, we don't have to make ALL the cell weights the same and, in fact, it may not be desirable to do so. Here's why. In the calculation of the interaction term in the 2x2 case, all four cell weights will be set to the same value, and this calculation will be done.

SSAB = harm.mean(n) * (M11 - M12 - M21 + M22)2 / 4

Setting all the cell weights equal controls the interaction for the unbalanced condition in BOTH of the main effects, which is what we want. However, when calculating the main effects, setting ALL the cell weights equal not only controls one main effect for the unbalanced condition in the other, but it also controls the main effects for the unbalanced condition in the interaction. We may not want to do this.

In the calculation of the Hospital main effect, to control Hospital for Patient, we do not need to set ALL the cell weights the same, they only need to be the same across each of the rows. That is, the weights for geriatric patients cells have to be the same, and the weights for obstetric patients cells have to be the same. Once again, as long as they are constant within rows, those weights can be anything, so let's make them 6 in the first row and 62/3 in the second row. ("I bet there's a reason you chose those particular values." You know me too well!) In this case, the balanced column means become:

Hospital I: (20.5 * 6 + 3.0 * 62/3) / (6 + 62/3) = 11.28947
Hospital II: (21.0 * 6 + 2.6 * 62/3) / (6 + 62/3) = 11.31579

Doing it this way controls Hospital for the unbalanced condition of Patient, but does not control Hospital for the unbalanced condition of the interaction. Now, let's find out why I chose those particular values.

WARNING: The following calculations work only in the df=1 case!

In a Type I analysis, if Hospital were entered first into a factorial ANOVA, the sum of squares for Hospital would be calculated from the weighted means, thus ignoring any possible confound with Patient. As we've seen, that's not necessarily a bad thing, depending upon what you hope to find out from your analysis. The calculation would be done this way: take the difference between the weighted column means, square it, and multiply by the harmonic mean of the column sizes divided by 2. (If this calculation sounds odd to you, fiddle around algebraically with the t-test formula a bit. It's exactly what you'd do in the course of calculating t2 between the two columns.)

> harm.mean = function(x)  1/mean(1/x)      # a function to calculate harmonic means
> (15+10/17 - 8)^2 * harm.mean(c(14,17))/2
[1] 442.0759
No attempt has been made to equalize any cell sizes, so no attempt has been made to eliminate any confound due to a second factor or an interaction.

In a Type III analysis, the sum of squares for Hospital would be calculated from the unweighted column means. (Warning once again: ONLY in the two-group or df=1 case!) Type III tests would control for the confound with the second factor by calculating an effective sample size for the cells and setting each cell size to that value. The effective cell size is just the harmonic mean of the realized cell sizes.

> harm.mean(c(4,10,12,5))                   # hmmmm, where have I seen that number before?
[1] 6.315789
As Maxwell and Delaney (2004) show, the sum of squares can now be obtained by multiplying the squared mean difference by the effective cell size.
> (11.75 - 11.8)^2 * harm.mean(c(4,10,12,5))
[1] 0.01578947
(NOTE: It's the same calculation we did for Type I. If the effective cell size is 6.3, then the effective column size is twice that, and the SS for Hospital is mean difference squared times effective column size divided by 2.)

You've no doubt noticed that the effective cell size is smaller than the arithmetic mean cell size, 7.75. This reflects the fact that an unbalanced design loses power to reject the null hypothesis compared to a balanced design. The more unbalanced the design is, the more power is lost.

The Type II analysis is the most complex, but in the df=1 case, the logic is not hard to follow. For all the nashing of teeth that has occurred concerning Type II tests, they actually do something quite similar (or at least in the same vein) as Type III tests. They equalize the cell sizes where they need to be equalized to remove the confounding from a given factor. They just don't set ALL the cell sizes (or more appropriately, weights) to the same value.

If we think of hypotheses as being about population values or parameters, then the point of calculating the marginal means is to estimate these population values. How best to do that? We've already seen two methods. One is to use the unweighted means of the cell means. I.e., we add up the cell means (in a row, for example) and divide by the number of them (the number of columns). Realized cell sizes are ignored (or used to calculate an effective cell size, which is actually quite irrelevant to calculating the unweighted means). Another method is called weighted means, in which the cell means are weighted by the cell sizes. This gives the mean of all the dependent variable values in a given row or column, as if the cross-classification were being ignored. We saw this when we examined the method called Type I. Thus, Type III, using unweighted means, statistically controls the confounding of the two factors created by the unequal realized cell sizes by setting all those cell sizes to an "effective cell size." Type I, using the weighted marginal means (for the first factor entered into the analysis only), does not attempt to control for the confounding, because it does not attempt to equalize the relevant cell sizes. Type III treats each cell mean equally but treats subjects differently. (Subjects in smaller cells count more that subjects in larger cells.) Type I treats each subject equally but treats the cell means differently. (Larger cells count more than smaller cells.)

Type II also attempts to control the confounding but takes a somewhat different approach to estimating the marginal means. Consider the calculation of the mean square for error. It's based on the variances in each of the treatment cells, but it doesn't just add those variances and divide by the number of cells. Rather, a so-called pooled variance is calculated, in which the cell variances are weighted by the realized cell sizes. The idea is that estimates based on larger samples are likely to be more accurate, and so variances based on larger samples are weighted more heavily in the calculation of the estimated population variance (mean square error). The pooled variance will come out somewhat closer in value to the "more accurate" variances in the larger cells than to the "less accurate" variances in the smaller cells. The Type II tests do something similar in estimating the marginal means. The idea is that bigger (and more balanced) rows should have more impact on the column means than smaller (and less balanced) rows do.

Once again, an "effective cell size" is calculated, but in the Type II analysis it is calculated row by row rather than over the whole table (for the column effect, that is). Once again, this is done by calculating harmonic means of the realized cell sizes. In the calculation of a column effect (df=1 only!), the "effective cell sizes" in the rows are set as follows.

> harm.mean(c(4,12))    # for the first row
[1] 6
> harm.mean(c(10,5))    # for the second row
[1] 6.666667            # (for my fraction-impaired students, 6 2/3)
The value is a bit larger for the second row even though the second row has fewer subjects because the cells in the second row are closer to balance. (See Maxwell and Delaney.) These values are then used as weights to calculate the column means from the cell means.
> (20.5 * 6 + 3.0 * 6.666667) / (6 + 6.666667)   # for the first column
[1] 11.28947
> (21.0 * 6 + 2.6 * 6.666667) / (6 + 6.666667)   # for the second column
[1] 11.31579
Once again, as shown by Maxwell and Delaney, the SSs can be calculated directly from the mean differences squared, this time by multipling by the average of the "effective sample sizes" used to calculate them. (NOTE: Once again, it's the same calculation. If the effective cell size in the first row is 6, and the effective cell size in the second row is 6.67, then the effective column size is the sum of those, and the SS for Hospital is mean difference squared times effective column size divided by 2.)
> (11.28947 - 11.31579)^2 * (6 + 6.666667)/2
[1] 0.004387369
That would also be the Type I sum of squares for Hospital if it were entered second into the model formula. In the df=1 case, the interaction sum of squares, which would be the same for all three analyses, can be calculated from a simple contrast, as shown by Maxwell and Delaney (and given above). It can also be calculated by subtraction in the Type I tests, since the Type I tests correctly partition the total sum of squares (which is just the SS of the dependent variable). The sum of squares for error, also the same for all three types, is calculated by summing the within-cell sums of squares, in this case,
1 + 14 + 6 + 3.2 = 24.2.

For df>1 cases, a least squares regression method is used, and the above calculations no longer work (except for error). But they convey the idea of what the different analyses are attempting to do and how they either do or don't attempt to control for confounds between the factors (and interacions) due to unbalanced cell sizes. The following table summarizes this once again for a somewhat more extreme 2x2 case than the one shown above, and perhaps for that reason one that shows a little better what's happening. (We will discuss these data below.)

           Calculation of the Column Main Effect (Note: Entered 1st in Type I)
------------------------------------------------------------------------------------------
The Cell Sizes:
        self-cutters noncutters
Females      15          65
Males         1          59

Mean Self-Esteem Scores (Rosenberg Scale):
                                             Row Weights
                                  ---------------------------------
                                     Type I
        self-cutters noncutters   col.1 col.2    Type II   Type III
Females     18.93333   22.75385     15    65    24.375     3.639671
Males       28.00000   23.32203      1    59     1.96667   3.639671

Calculation of the Column Means:
Type I
     col.1: (18.93333 * 15 + 28.00000 * 1) / (15 + 1) = 19.5
     col.2: (22.75385 * 65 + 23.32203 * 59) / (65 + 59) = 23.02419
Type II
     col.1: (18.93333 * 24.375 + 28.00000 * 1.96667) / (24.375 + 1.96667) = 19.61025
     col.2: (22.75385 * 24.375 + 23.32203 * 1.96667) / (24.375 + 1.96667) = 22.79627
Type III
     col.1: (18.93333 * 3.639671 + 28.00000 * 3.639671) / (3.639671 + 3.639671) = 23.46667
     col.2: (22.75385 * 3.639671 + 23.32203 * 3.639671) / (3.639671 + 3.639671) = 23.03794

Calculation of Sum of Squares:
Type I:   (19.5 - 23.02419)^2 * harm.mean(c(16,124))/2 = 176.008
Type II:  (19.61025 - 22.79627)^2 * (24.375 + 1.96667)/2 = 133.693 (or Type I entered 2nd)
Type III: (23.46667 - 23.03794)^2 * (3.639671 + 3.639671)/2 = 0.669006
------------------------------------------------------------------------------------------
Notice in this example, the Type I and II tests are giving quite a different answer than the Type III tests. The interaction (profile) plot was given in the second section of this article, if you want to scroll back and have a look. We'll discuss this case in considerable detail near the end of this article.

For the record, in the regression method, which is equivalent to the above in df=1 cases, but must be used in df>1 cases, the sum of squares is calculated by comparing an enhanced model (i.e., a model containing the effect being evaluated) with a reduced model (a model without that effect). The following table summarizes this for the three Types of tests.

Source                 Enhanced            Reduced
---------------------------------------------------------------------------
Factor A: Type I      DV ~ A              DV ~ 1 (i.e., just the intercept)
        : Type II     DV ~ A + B          DV ~ B
        : Type III    DV ~ A + B + AB     DV ~ B + AB
Factor B: Type I      DV ~ A + B          DV ~ A
        : Type II     DV ~ A + B          DV ~ A
        : Type III    DV ~ A + B + AB     DV ~ A + AB
A x B                 DV ~ A + B + AB     DV ~ A + B
---------------------------------------------------------------------------
For example, to get the Type II sum of squares for Hospital in the Howell and McConaughy data, we could do this.
> reduced = lm(Stay ~ Patient, data=HM)
> enhanced = lm(Stay ~ Patient + Hospital, data=HM)
> anova(reduced, enhanced)
Analysis of Variance Table

Model 1: Stay ~ Patient
Model 2: Stay ~ Patient + Hospital
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     29 25.483                           
2     28 25.479  1  0.004386 0.0048 0.9451
When the effect of interest is included in the enhanced model, it is removed from the residual sum of squares. So the difference in the residual sums of squares (RSS) is the sum of squares for the effect. (Note: To get the Type III tests this way, you're going to have to dummy code the IVs, because R will not let you do a model of the type DV ~ A + AB. Furthermore, when you do the dummy coding, use effects, or sum-to-zero, codes, not 0/1 codes. For a detailed example, see this website.)

Type II tests especially have often been cast into the rubbish bin of statistical tests as testing "meaningless" (Howell & McConaughy, 1982) hypotheses about the marginal means. Indeed, when one looks at mathematical expressions of the hypotheses that are tested, one might be inclined to agree! You can find those expressions here, among other places. (Hint: If you're math phobic, don't click that link! You've been warned!) However, what it all boils down to after all the subscripts are accounted for is an attempt to control confounding by equalizing cell sizes where they need to be equalized, and in the case of Type II, NOT where they don't need to be equalized.


I've Heard There's Some Sort of Issue With Interactions

The differences among the three methods have to do with how they attempt to implement statistical control for the confounding that exists among the factors when the design is unbalanced. This is an issue only when it comes to testing the main effects (in a two-factor design). The interaction test (highest order interaction test) is the same in all three methods.

So the issue with interactions is not that they test differently (at least not the highest order interaction). The issue is over how to interpret the main effects when an interaction is in force.

First, when there is a nonzero interaction, should it be removed from the total variability before the main effects are evaluated? (Not removed as in deleted from the model but removed from the main effects.) In other words, should main effects be controlled for interactions? Should main effects ever be treated as if they were entered into the analysis after interactions? (Is Type III analysis ever justified?)

The Type I/II people say, "No! Never! Such a stunt violates the principle of marginality." Marginality says that you should never include a higher-order term in your model unless all the lower order terms that are involved in it are also included. E.g., if we have AxB, then we must also have A and B. We can do DV ~ A + B + A:B, but we can't (or shouldn't) do DV ~ A + A:B. Yet this is exactly what Type III does when it controls main effects for interactions. In the course of running through the underlying math, models such as DV ~ A + A:B are used in the calculations.

How seriously people take the principle of marginality seems to depend upon what type tests they have already decided to do. Type I/II people take it seriously. Type III people rarely mention it. It seems to me that the issue is a practical one. We all know that ANOVA is not a perfect technique that always yields the correct answer, and we all know it is sometimes used in imperfect circumstances. The issue should be, does violating (or not violating!) the principle of marginality damage the integrity of the calculations? The answer appears to be yes, in some circumstances it does (both ways!). So the question should not be, should we allow violations of the principle of marginality? It should be, when are those violations harmful, and when is failing to violate the principle harmful? The issue is dauntingly complex. A partial short answer would seem to be, if the imbalance in the design is mild or minor, it doesn't make much difference. If the imbalance is moderate or severe, it can make a huge difference. We are then faced with the question, which way is "correct?" Which way gives the "right answer?" The principle of marginality is more of a guide or a helpful suggestion than it is a law. When the design is more than mildly unbalanced, it can be a very good guide!

Some people have argued that Type II tests (in particular) should never be used in the presence of a nonzero interaction, because they inflate the false positive rate for the main effects. This and testing "meaningless" hypotheses are probably the two stock criticisms of Type II tests. Other people have argued that in the presence of a nonzero but nonsignificant interaction, Type II tests should be used because they are more powerful at finding the main effects (Landsrud, 2003; Macnaughton, 1998). I find both of these argument dubious.

First of all, if the interactions are dropped from the model, so that the model becomes additive, then Type II and Type III tests produce identical results. If the nonsignificant interactions are not dropped, then the correct statement is that Type II tests are usually more powerful. Sometimes that is not the case. At any rate, if the tests are more powerful because they are inflating the false positive rate for main effects, then choosing Type II tests based on the "more powerful" argument is dubious. (All of this assumes we are testing traditional main effects that appear in the marginal means.)

The counter to the "more powerful" argument is that in the presence of a nonzero interaction, significant or not, Type II tests inflate the false positive rate. Based on Monte Carlo simulations I've done, testing for traditional main or marginal effects, this appears to be true, but only slightly. Based on the size of the inflation I've seen in simulations, I find this argument almost humorous when applied to a technique that is notorious for inflating false positive rates. In a two-factor design, we do three tests -- two main effects and one interaction -- EACH at an alpha of .05. In a three-factor design we do SEVEN tests, EACH at an alpha of .05. In higher order designs, the inflation problem only gets worse. To worry about inflating an alpha level from .05 to .055 by using Type II tests, the typical magnitude of inflation I observed in a 3x3 simulation, strikes me as being just a tad fussy. In 2x2 simulations, there was no inflation.

I hasten to point out that these simulations were done with cases in which there were no marginal or interaction effects of any kind in the population but in which sampling error produced a nonzero interaction in the sample. Spector, Voissem, and Cone (1981) found false positive rates as high as .3 for the main effects in the presence of a significant interaction, using Overall and Spiegel's Method II (Type II analysis in the language of this article). In cases where there were no effects in the (simulated) population, the results of Spector and colleagues agreed closely with my results (minimal if any inflation).

That's certainly a disturbing finding on the face of it. But I'm not surprised. This means that they were simulating cases in which the population contained an interaction but no main effects (otherwise the main effect would not be a false positive). This is the infamous "X-case" on the profile plot that gives students so much of a problem every semester in their research methods classes. It brings us face to face with the issue of whether it is meaningful to examine main effects in the presence of an interaction. Arguments have been made on both sides. Venables (1998) implied that it's "silly." Others (e.g., Howell & McConaughy, 1982) have suggested that it sometimes can be useful.

Interactions
I find myself leaning towards the "sometimes can be useful" group, especially when the interaction is a small effect, and the marginal effect or effects are large. However, the only case in which an interaction can occur without main effects is in the case of a disordinal interaction (sometimes incorrectly defined as cases when the profile lines on the interaction plot cross). While it MAY sometimes be useful to look at main (marginal) effects in the presence of an interaction, the case of a disordinal interaction is certainly NOT one of those times! A disordinal interaction means the effect of A changes in direction at different levels of B, and the whole notion of a main effect is such cases is, indeed, "silly."

I ran simulations using this proportionally balanced design. Numbers in the table represent cell frequencies or group sizes.

    B
A    B1 B2 B3 B4
  A1  2  2  4  8
  A2  3  3  6 12
  A3  6  6 12 24
In the "infamous X-case," my results agreed with those of Specter et al, but I would have to argue very strongly that in this case the main effects are meaningless. There are no main effects. There may be marginal effects, but the beginning and end of our interest here is going to be in the interaction. The ANOVA will report on "main effects," but that report should be ignored (which will also help to keep down the false positive rate inherent in ANOVA). If the analysis fails to find a statistically significant interaction, then the analysis has failed to find any meaningful effect to be statistically significant.

It is possible to have a two-factor interaction and only one marginal effect, in the case where the profile lines diverge. I simulated this condition in the current design by adding and subtracting respectively 1/2 sigma to the values at B3A1 and B3A3 and 1 sigma to the values at B4A1 and B4A3. In both Type II and III analyses, this interaction effect was found about 54-55% of the time. (Of course this result was the same for both types.) In the Type III analysis the A marginal effect was found about 50% of the time, and a false positive B marginal effect about 5% of the time. In the Type II analysis, the A marginal effect was found about 98% of the time, and a false positive B marginal effect was found 18-20% of the time.

When the interaction was made to go in the opposite direction, i.e., the profile lines diverging more in the direction of the smaller groups (B1 and B2), the interaction was found about 40% of the time by both analyses. On the other hand, the Type III analysis was more successful at finding the A marginal effect than was Type II (52-54% of the time vs. 21% of the time). False positive rates for the B marginal effect remained inflated with the Type II analysis (12-14%) but were not inflated or only slightly inflated with the Type III analysis.

But I tricked myself, didn't I? (As have some similar results published in reputable journals tricked me!) By making the lines diverge, I didn't create an ordinal interaction in the data, I created another disordinal interaction. Consider the following profile plots, which are from a study that we'll discuss in some detail below.

Interactions
By the "lines-cross" definition of a disordinal interaction, the interaction on the left is disordinal, while the interation on the right is not. I've seen the "lines-cross" definition at various places on the Internet and even in a textbook or two, but it's wrong! Both of these graphs show the same means! If the interaction is disordinal on the left, then it's disordinal on the right. The proper definition of a disordinal interaction is one in which the simple effects of at least one of the factors go in opposite directions. The red arrows show the simple effects of Cut.Scar at each of the two levels of Sex.

So once again, the inflation I was seeing in the previous simulation study may be a moot point. If we're seeing a disordinal interaction, we shouldn't even be looking at main effects. So yes, in the presence of a disordinal interaction, false positive rates for marginal effects that shouldn't exist can be substantially inflated in the Type II tests. What about when the interaction is ordinal? Here's the catch. You can't create an ordinal interaction in the above design without creating two marginal effects! You can't have false positives for marginal/main effects in such cases.

Nevertheless, I ran simulations after creating an interaction as follows. In the above design, the A3 profile line was left flat. The A2 profile line was increased by 1/8 sigma at B1, 1/4 sigma at B2, 1/2 sigma at B3, and 3/4 sigma at B4. The A1 profile line was increased by twice those values. The Type II tests found the A marginal effect 93-95% of the time and the B marginal effect 17-18% of the time. The Type III tests did not do as well with the A effect (55-61%) but did better with the B effect (32-35%). Both types found the interaction 17-19% of the time. Then I reversed the direction of the interaction, placing the smaller effects at B4 and the larger effects at B1. The Type II tests found the A marginal effect 37-42% of the time and the B marginal effect 14-18% of the time. The Type III tests did better with both effects, finding the A effect 56-60% of the time and the B effect 25-29% of the time. The interaction was found 15-17% of the time.

This still leaves us to cope with the disturbing case, as in the graphs above, of a disordinal interaction that is not statistically significant. Maybe we should just redouble our efforts to keep all our designs balanced? We can't! If we're taking a random sample from a population of intact groups, we take what we get! And if we don't, if we try subsequently to adjust those group sizes towards balance, so much for our random sample!

  • Howell, D. C. and McConaughy, S. H. (1982). Nonorthogonal analysis of variance: Putting the question before the answer. Educational and Psychological Measurement, 42(1), 9-24.
  • Langsrud, O. ANOVA for unbalanced data: Use Type II instead of Type III sums of squares. Statistics and Computing, Vol 13(2), Apr 2003, 163-167.
  • Macnaughton, D. B. PDF of a paper presented at the Joint Statistical Meetings, Boston, 1992.
  • Spector, P. E.; Voissem, N. H.; Cone, W. L. (1981). A Monte Carlo study of three approaches to nonorthogonal analysis of variance. Journal of Applied Psychology, 66(5), 535-540.
  • W. N. Venables (2000). Exegeses on Linear Models. Paper presented to the S-Plus users conference. Washington, D.C. 8th-9th October, 1998.

More on Interactions

Those pesky interactions aren't done with us yet! I want to talk about two issues. 1) When is an interaction not an interaction, and vice versa? 2) Are main effects ever meaningful in the presence of an interaction?

I found it necessary to temper my views on looking at main effects when there is an interaction in force because I'm a psychologist. In my own subfield, physiological psychology, we're stricter about measurement than most psychologists are, but even we do some stuff that hard science would consider pretty soft.

When I was doing my dissertation research, I found it necessary to assess the emotionality of rats. I had several measures, but one was to pick the rat up and count the number of times I was bitten. (I was wearing a chain mail glove. Even grad school isn't that cruel!) You can almost expect one bite from a rat that hasn't been gentled. One bite doesn't mean much. But if the rat continues to gnaw on you, that is one POed rat! I seriously doubt that it is an interval scale of emotionality.

If we're doing an experiment in which we give a subject a list of 30 words, and then some time later ask him or her to recall as many of the words as s/he can, we're not really interested in how many words a person can recall from a list of 30. How often does that come up?! We're interested in how some treatment we've applied has affected the subject's ability to memorize or recall. I doubt number of words recalled is a linear measure of that. Getting the first 6 or 7 is pretty easy. Then it gets harder, and by the time the subject has made it into the upper teens or lower twenties, each additional word recalled is a much more difficult chore. So does this graph show an interaction?

Words Recalled

Or does it show an artifact of our scale of measurement? And if that's the case, then at least we have an interesting main effect or two to look at.

Ramsey and Schafer, in their very excellent stat book, The Statistical Sleuth, discuss an experiment by Roger Fouts (1973) on American Sign Language learning by chimpanzees. Four chimps were taught the same 10 signs each, and interest was in differences among the chimps, and differences in difficulty of learning among the signs. The measure of how hard the sign was to learn was minutes to a criterion demonstrating that the sign had been mastered.

It's a treatment-by-subjects design, a two-way classification without replication. As such, any variability due to what appears to be an interaction becomes the error term. There was an apparent chimp-by-sign interaction, and Ramsey and Schafer suggested making it go away with a log transform of the DV. What if that interaction was real and not just noise? It was obvious on the minutes scale, but not apparent on the log(minutes) scale. Which is the "real" measure of sign learning difficulty?

In some of the "softer" areas of psychology, measurements are frequently on rating scales. "How much do you like chocolate? 1 means yuck, and 7 means totally addicted." What do the values on that scale mean?

In the soft sciences, we have issues with measurement scales, ceiling effects, floor effects, etc. So when is an interaction REALLY an interaction? And if it looks like one but really isn't, should we then be looking at the main effects?

If the measurement is a "hard" measurement, then those issues go away. If the measurement is a "hard" measurement, should we ever look at main effects in the presence of an interaction? Never say never, so I'll say almost never. There might be some practical value. It could help me make a decision as to which product I should stock in my store, even though the profitability of the candidates is not equally different over months or seasons. But in a scientific setting, if we're attempting to answer questions about what nature is like, and it's a "hard" interaction, then no. If there is an interaction, THAT is THE effect of interest.

Recall the issue we had with the hospital problem when we used the age of the patients as a covariate. The mean age of the patients was 50.2 years, so we used that value to calculate adjusted mean stay length for obstetric and geriatric patients, even though there were no patients in our sample with an age anywhere near that. Was it an unjustified extrapolation? When the explanatory variable is not numeric but a factor, the issue becomes almost comical.

Myowelt

An interesting discussion occurred at myowelt.blogspot.com concerning an experiment that produced the graph at right. The design was unbalanced, and the researcher wanted to know how to get the same ANOVA result in R as he was getting in SPSS. Having looked into it, he ran smack into this issue of whether main effects should ever be entered into an ANOVA after an interaction, and if there is an interaction (as there is here), do the main effects even make sense. I particularly enjoyed the comment of Ista Zahn. "With respect to the specific analysis you describe here: As I understand it, it really doesn't make sense to talk about main effects for either gender or stereotype threat. The gender main effect here refers to the effect of gender averaged across levels of stereotype threat. In other words, it is telling you the effect of gender at a non-observed value of stereotype threat that is halfway between Yes and No. Likewise, the stereotype threat main effect is giving you the effect of stereotype threat at a value halfway between male and female, which makes even less sense. What I think you want to do is test the effect of stereotype threat separately at male and female, and to test the effect of male vs. female at high and low stereotype threat."

How much better could that be explained? When you have an interaction, you do not look at main effects, you look at simple effects. I.e., you look at the effect of one variable at each level separately of the other variable.

If we have an interaction, we don't need to worry about whether there is inflation in the false positive rate for main effects, because we're not going to be looking at main effects. The problem is not solved, however. There is still the issue of nonsignificant interactions. As I tell my students, nonsignificant does not mean nonexistent. A nonsignificant interaction could, in fact, be a very real interaction. If it's nonsignificant, should we just behave as if it doesn't exist? And how nonsignificant does it have to be before we can do that? This gets into issues of null hypothesis significance testing, and whether there is a meaningful difference between p = .045 and p = .055, and I'm just not going to go there!

  • Fouts, R. S. (1973). Acquisition and testing of gestural signs in four young chimpanzees. Science, 180, 978-980.
  • Ramsey, F. L. and Schafer, D. W. (2002). The Statistical Sleuth: A Course in Methods of Data Analysis (2nd ed.). Pacific Grove, CA: Duxbury, p. 410.

Lessons From A Higher-Order Design

If you think it's bad enough already, then you're really going to find this unpleasant! Restricting our discussion to the simplest possible case of a two-way design makes things seem simpler than they really are! Aitkin (1978) discussed the analysis of a four-way design, and thanks to the good people at the R-Project, we have those data.

> data(quine, package="MASS")
> summary(quine)
 Eth    Sex    Age     Lrn          Days      
 A:69   F:80   F0:27   AL:83   Min.   : 0.00  
 N:77   M:66   F1:46   SL:63   1st Qu.: 5.00  
               F2:40           Median :11.00  
               F3:33           Mean   :16.46  
                               3rd Qu.:22.75  
                               Max.   :81.00
From the help page: "Children from Walgett, New South Wales, Australia, were classified by Culture, Age, Sex and Learner status and the number of days absent from school in a particular school year was recorded." The factors are: Eth - ethnic background (A = Aboriginal, N = not), Sex - gender (F and M), Age - age group (F0 = primary, and F1, F2, and F3 are forms 1, 2 and 3, which are kind of like grade levels in the U.S.), and Lrn - learner status (AL = average learner, SL = slow learner). The DV, Days, is days absent. The data are from Quine (1975). The analysis is also discussed in Venables and Ripley (2002, p. 170).

I don't know what SPSS* or SAS would do with these data, but neither aovIII() nor Anova( , type=3) from the "car" package will produce a meaningful result. I suspect this is due to the empty cells in the design. (*See the footnote at the end of this section for an update on that statement about SPSS.)

> with(quine, table(Sex,Age,Eth,Lrn))
, , Eth = A, Lrn = AL

   Age
Sex F0 F1 F2 F3
  F  4  5  1  9
  M  5  2  7  7

, , Eth = N, Lrn = AL

   Age
Sex F0 F1 F2 F3
  F  4  6  1 10
  M  6  2  7  7

, , Eth = A, Lrn = SL

   Age
Sex F0 F1 F2 F3
  F  1 10  8  0
  M  3  3  4  0

, , Eth = N, Lrn = SL

   Age
Sex F0 F1 F2 F3
  F  1 11  9  0
  M  3  7  3  0
Here are the Type I and Type II results (assuming "car" is loaded for the Type II results).
> summary(aov(Days ~ Eth * Sex * Age * Lrn, data=quine))
                 Df Sum Sq Mean Sq F value  Pr(>F)    
Eth               1   2981  2980.5  14.959 0.00018 ***
Sex               1    279   279.0   1.400 0.23904    
Age               3   2119   706.3   3.545 0.01675 *  
Lrn               1    689   689.3   3.460 0.06538 .  
Eth:Sex           1    191   190.6   0.957 0.33003    
Eth:Age           3   2807   935.8   4.697 0.00391 ** 
Sex:Age           3   2405   801.7   4.024 0.00913 ** 
Eth:Lrn           1      2     1.7   0.008 0.92730    
Sex:Lrn           1    175   175.1   0.879 0.35044    
Age:Lrn           2    574   287.0   1.441 0.24094    
Eth:Sex:Age       3    223    74.4   0.374 0.77224    
Eth:Sex:Lrn       1   1021  1020.6   5.123 0.02544 *  
Eth:Age:Lrn       2    801   400.5   2.010 0.13854    
Sex:Age:Lrn       2    332   166.1   0.834 0.43699    
Eth:Sex:Age:Lrn   2    196    97.8   0.491 0.61338    
Residuals       118  23510   199.2                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> Anova(aov(Days ~ Eth * Sex * Age * Lrn, data=quine))
Anova Table (Type II tests)

Response: Days
                 Sum Sq  Df F value    Pr(>F)    
Eth              2415.5   1 12.1237 0.0006989 ***
Sex               358.9   1  1.8012 0.1821476    
Age              2794.1   3  4.6746 0.0040184 ** 
Lrn              1093.5   1  5.4886 0.0208151 *  
Eth:Sex           137.9   1  0.6923 0.4070599    
Eth:Age          1595.8   3  2.6699 0.0507382 .  
Sex:Age          2796.0   3  4.6778 0.0040024 ** 
Eth:Lrn            20.6   1  0.1033 0.7485176    
Sex:Lrn            17.3   1  0.0869 0.7686597    
Age:Lrn           665.0   2  1.6689 0.1928708    
Eth:Sex:Age       107.6   3  0.1799 0.9098279    
Eth:Sex:Lrn       521.7   1  2.6183 0.1083107    
Eth:Age:Lrn       790.2   2  1.9831 0.1422064    
Sex:Age:Lrn       332.2   2  0.8337 0.4369876    
Eth:Sex:Age:Lrn   195.6   2  0.4908 0.6133802    
Residuals       23510.2 118                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Of course, with four factors, there are 4*3*2*1=24 different Type I results. I'll refrain from showing them all! I had no good reason for using the order of factors I did. It's just the order in which I found them in the data frame. (Note: Both Venables and Ripley and Aitkin, eventually, suggested transforming the DV before doing the analysis. I'll pass on that, because the issues I want to address can be illustrated without it.)

First, let's look at the 4-way interaction. Why is it evaluated on df=2 and not df=1*1*3*1=3? Once again, that has to do with the empty cells. Notice that it's not significant. (The aovIII function returned this same result.)

Second, let's look at the Lrn main effect. In the Type I table, Lrn is entered after Eth, Sex, and Age, and therefore is controlled for possible confounds with all the other factors. The SS = 689.288. What's happening in the Type II table? Why isn't it the same? Aren't Type II main effects controlled for possible confounds with all the other main effects? Yes, but Type II main effects are also controlled for all of the higher order effects that don't include them. Lrn, for example, is not only entered after Eth, Sex, and Age, but also after Eth:Sex, Eth:Age, Sex:Age, and Eth:Sex:Age. In other words, the Type II analysis is very thorough at removing the effects of other variables before evaluating a main effect, and that includes effects of other variables that might be part of an interaction not involving the variable under consideration. In evaluating the Lrn main effect, for example, the Type II analysis removes all variability due to Eth*Sex*Age first.

Third, same comment for the interactions. From our discussion of the two-way design, we might be led to expect that the last two-way interaction, Age:Lrn, should come out the same in a Type I vs. Type II analysis. Nope again, and this time I'm not entirely sure why! But something is clearly removed (or not removed!) in the Type II analysis that is not (is!) in the Type I analysis. If you really need to know, I suggest you consult Fox and Weisberg (2011), because if they don't explain it, probably nobody will! (Note: At the website of the University of North Texas, Research and Statistical Support, I found the following statement. "Type II - Calculates the sum of squares of an effect in the model adjusted for all other 'appropriate' effects where an appropriate effect is an effect that does not contain the effect being examined." Does that mean Age:Lrn is also adjusted for Eth:Sex:Age and Eth:Sex:Lrn? I don't know, but that appears to be the case.)

Fortunately for me, for the analysis that I am going to suggest, I don't need to know! But before we get to that, I suspect a lot of people would do here whatever their statistical software allows them to do and would give it no further thought. Bad idea! What's the point of doing a statistical analysis that you don't understand? For example, SAS allows an analysis using what are called Type IV sums of squares, which are intended for cases in which there are empty cells in the design. When there are no empty cells, Type IV defaults to Type III. I would bet my paycheck that a lot of SAS users would just run these data through SAS and take whatever they're given without really understanding what it is. These people should be banned from using statistical software altogether! I'll say the same for people who use SPSS and just take whatever SPSS spits at them without knowing the first thing about what "Type III sums of squares" really means. After all, it's SPSS. How can it be wrong? At one time, I confess, that would have included me!

My stat professor from grad school used to say, if you are running a high-order ANOVA, pray you don't get a high-order interaction, because then you have to explain it! I know some people who don't even look at interactions higher than third order, because who can understand them anyway? I don't know about the soundness of that logic. I'll just say, phew! We don't have a 4-way interaction here! Where do we go next?

I'm going to say, delete the 4-way interaction from the model, and start looking at the 3-ways. I would do that as follows (following Venables and Ripley).

> aov.2ways = aov(Days ~ (Eth + Sex + Age + Lrn)^2, data=quine)
> aov.3ways = aov(Days ~ (Eth + Sex + Age + Lrn)^3, data=quine)
> anova(aov.2ways, aov.3ways)
Analysis of Variance Table

Model 1: Days ~ (Eth + Sex + Age + Lrn)^2
Model 2: Days ~ (Eth + Sex + Age + Lrn)^3
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    128 26083                           
2    120 23706  8    2377.1 1.5041 0.1627
This test compares the model with just 2-way interactions with the model that also includes 3-way interactions. (See the tutorial on Model Formulae if you're confused by this notation.) There is no significant difference, p = .16, suggesting that we can drop the 3-ways and start looking at the 2-ways. Before we do that, let me show you in more detail what's just happened. Here are the Type I tests without the 4-way interaction.
> summary(aov(Days ~ (Eth + Sex + Age + Lrn)^3, data=quine))
             Df Sum Sq Mean Sq F value   Pr(>F)    
Eth           1   2981  2980.5  15.087 0.000169 ***
Sex           1    279   279.0   1.412 0.237015    
Age           3   2119   706.3   3.575 0.016067 *  
Lrn           1    689   689.3   3.489 0.064209 .  
Eth:Sex       1    191   190.6   0.965 0.327946    
Eth:Age       3   2807   935.8   4.737 0.003696 ** 
Eth:Lrn       1     10    10.2   0.052 0.820516    
Sex:Age       3   2397   798.9   4.044 0.008869 ** 
Sex:Lrn       1    175   175.1   0.886 0.348361    
Age:Lrn       2    574   287.0   1.453 0.237989    
Eth:Sex:Age   3    223    74.4   0.377 0.769940    
Eth:Sex:Lrn   1   1021  1020.6   5.166 0.024807 *  
Eth:Age:Lrn   2    801   400.5   2.027 0.136177    
Sex:Age:Lrn   2    332   166.1   0.841 0.433881    
Residuals   120  23706   197.5
And here is what we just did.
> SS = 223 + 1021 + 801 + 332     # SS due to the 3-ways combined
> df = 3 + 1 + 2 + 2              # df for 3-ways (8)
> MS = SS / df                    # MS for 3-ways
> MS / 197.5                      # 3-ways MS divided by error MS = F
[1] 1.50443
> pf(1.50443, df, 120, lower=F)   # get a p-value for F(8, 120)
[1] 0.1626194                     # same as above within rounding error
"Wait a minute!" you're saying. "There was a significant 3-way interaction in the table above. What happened to that?" It was only significant because it was the 2nd interaction in the list, the latter two having not yet been removed/controlled. Let's make it last and see what happens. Type I tests, remember? Order matters!
> summary(aov(Days ~ (Age + Eth + Sex + Lrn)^3, data=quine))
             Df Sum Sq Mean Sq F value   Pr(>F)    
Age           3   2535   845.0   4.278 0.006598 ** 
Eth           1   2720  2720.3  13.770 0.000314 ***
Sex           1    123   122.9   0.622 0.431809    
Lrn           1    689   689.3   3.489 0.064209 .  
Age:Eth       3   2833   944.3   4.780 0.003501 ** 
Age:Sex       3   2332   777.2   3.934 0.010188 *  
Age:Lrn       2    692   345.8   1.751 0.178086    
Eth:Sex       1    280   279.8   1.416 0.236377    
Eth:Lrn       1      0     0.1   0.000 0.985441    
Sex:Lrn       1     18    17.8   0.090 0.764607    
Age:Eth:Sex   3    223    74.4   0.377 0.769940    
Age:Eth:Lrn   2   1311   655.3   3.317 0.039611 *  
Age:Sex:Lrn   2    322   160.8   0.814 0.445564    
Eth:Sex:Lrn   1    522   521.7   2.641 0.106784    
Residuals   120  23706   197.5
The Eth:Sex:Lrn interaction is not significant when it's controlled for all the other 3-ways (i.e., entered as the last 3-way). It's also possible to get the Type II tests here on each of the 3-ways individually (i.e., as if each interaction were the last to be entered), and we don't even have to resort to the "car" package to do it!
> drop1(aov.3ways, test="F")
Single term deletions

Model:
Days ~ (Eth + Sex + Age + Lrn)^3
            Df Sum of Sq   RSS    AIC F value Pr(>F)
<none>                   23706 795.12               
Eth:Sex:Age  3    107.55 23813 789.78  0.1815 0.9088
Eth:Sex:Lrn  1    521.66 24227 796.30  2.6407 0.1068
Eth:Age:Lrn  2    790.21 24496 795.91  2.0000 0.1398
Sex:Age:Lrn  2    332.21 24038 793.15  0.8408 0.4339
These are Type II tests on the 3-way interactions in a model in which the 4-way interaction has already been deleted. Either way we're being told, "Nothing to see here." I also hasten to point out, for Type III fans who may be reading this, that since we've deleted the 4-way interaction from the model, these are also the tests you would get using Type III, exactly! I'm actually quite pleased--and relieved!--by that for a reason I'll point out below. So, without further ado, on to an examination of the 2-way interactions.
> aov.1ways = aov(Days ~ Eth + Sex + Age + Lrn, data=quine)     # i.e., just main effects
> anova(aov.1ways, aov.2ways)
Analysis of Variance Table

Model 1: Days ~ Eth + Sex + Age + Lrn
Model 2: Days ~ (Eth + Sex + Age + Lrn)^2
  Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
1    139 32237                                
2    128 26083 11    6153.8 2.7454 0.003189
Looks like we just hooked a big one! Somewhere among the 2-way interactions there is something important. (Pardon the fishing metaphor, but we are actually doing less fishing here than we would be doing in a traditional ANOVA.)
> drop1(aov.2ways, test="F")      # Type II tests on the 2-ways
Single term deletions

Model:
Days ~ (Eth + Sex + Age + Lrn)^2
        Df Sum of Sq   RSS    AIC F value  Pr(>F)   
<none>               26083 793.07                   
Eth:Sex  1    268.52 26351 792.57  1.3178 0.25314   
Eth:Age  3   2343.93 28427 799.64  3.8342 0.01142 * 
Eth:Lrn  1      0.05 26083 791.07  0.0002 0.98808   
Sex:Age  3   2790.31 28873 801.91  4.5644 0.00451 **
Sex:Lrn  1     17.79 26101 791.17  0.0873 0.76809   
Age:Lrn  2    574.01 26657 792.25  1.4085 0.24828   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These are the Type II (also Type III) tests on the 2-way interactions in a model where the 4-way and 3-way interactions have already been deleted. We have two significant 2-way interactions. Before we take the paring knife to the rest of them though, let me point out something else. (Oh, now we're making a kitchen knife metaphor, are we? Instead of mixing my metaphors, maybe I should have said filleting knife.)

Instead of guiding our model selection using p-values, we could instead be using AIC values. AIC means Akaike Information Criterion. Look at the Type II tests again on the 2-ways, except this time look at the column labeled AIC. The table shows the effect of deleting the 2-way interactions one at a time. Any deletion that decreases the AIC value is a deletion that should be done, according to this criterion. Thus, the p-value criterion and the AIC are telling us to delete the same four 2-way interactions. Nice! However, take a look back at the 3-way interactions. If we were using AIC as our model selection criterion, two of those would have been retained. AIC is not terribly well known in the social sciences, at least not yet, but there are some compelling arguments for using it as a model selection criterion instead of p-values. Something you might watch out for as we social scientists become more sophisticated in our model building. I'll have reason to mention it again in a later section.

Okay, let's fillet the model.

> model = aov(Days ~ Eth + Sex + Age + Lrn + Eth:Age + Sex:Age, data=quine)
> anova(model, aov.2ways)
Analysis of Variance Table

Model 1: Days ~ Eth + Sex + Age + Lrn + Eth:Age + Sex:Age
Model 2: Days ~ (Eth + Sex + Age + Lrn)^2
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    133 27072                           
2    128 26083  5    989.26 0.9709 0.4382
Nothing important/significant has been lost by dropping four of the 2-way interactions, says the p-value criterion. The model with all six 2-ways and the model with four of them deleted are not significantly different, p=.44. Some caution is required here in trying to understand what is being compared in the previous test. If we try to use the same trick we used previously, it won't work.
> print(summary(aov.2ways), digits=6)       # Type I tests
             Df   Sum Sq  Mean Sq  F value     Pr(>F)    
Eth           1  2980.51 2980.509 14.62666 0.00020376 ***
Sex           1   279.01  279.006  1.36920 0.24412282    
Age           3  2118.81  706.270  3.46597 0.01826406 *  
Lrn           1   689.29  689.288  3.38264 0.06820383 .  
Eth:Sex       1   190.60  190.603  0.93537 0.33529457    
Eth:Age       3  2807.28  935.761  4.59219 0.00435371 ** 
Eth:Lrn       1    10.21   10.213  0.05012 0.82320801    
Sex:Age       3  2396.58  798.858  3.92035 0.01023225 *  
Sex:Lrn       1   175.10  175.096  0.85927 0.35568703    
Age:Lrn       2   574.01  287.006  1.40846 0.24828201    
Residuals   128 26082.86  203.772

> SSdeleted = 190.60 + 10.21 + 175.10 + 574.01
> SSdeleted
[1] 949.92
But the test above says we are deleting SS = 989.26, i.e., dumping that into the error term, and not 949.92. What went wrong? The problem is that the 2-ways we're keeping are mixed in with the 2-ways we're deleting. That means some of the 2-ways we're deleting have been adjusted for the 2-ways we're keeping, and some have not. If we want to see in detail what it is we're getting rid of, we have to delete from the bottom of the Type I table upward.
> aov.2ways.rearranged = aov(Days ~ Eth + Sex + Age + Lrn + Eth:Age + Sex:Age +
+      Eth:Sex + Eth:Lrn + Sex:Lrn + Age:Lrn, data=quine)
> print(summary(aov.2ways.rearranged), digits=6)
             Df   Sum Sq  Mean Sq  F value     Pr(>F)    
Eth           1  2980.51 2980.509 14.62666 0.00020376 ***
Sex           1   279.01  279.006  1.36920 0.24412282    
Age           3  2118.81  706.270  3.46597 0.01826406 *  
Lrn           1   689.29  689.288  3.38264 0.06820383 .  
Eth:Age       3  2832.80  944.266  4.63393 0.00412916 ** 
Sex:Age       3  2331.72  777.241  3.81426 0.01171368 *  
Eth:Sex       1   238.49  238.488  1.17036 0.28136035    
Eth:Lrn       1     1.67    1.666  0.00817 0.92810353    
Sex:Lrn       1   175.10  175.096  0.85927 0.35568703    
Age:Lrn       2   574.01  287.006  1.40846 0.24828201    
Residuals   128 26082.86  203.772

> SSdeleted = 574.01 + 175.10 + 1.67 + 238.49
> SSdeleted
[1] 989.27
And the result has been duplicated to within rounding error. Now on to an examination of our final model.
> Anova(model)                    # with "car" loaded
Anova Table (Type II tests)

Response: Days
           Sum Sq  Df F value    Pr(>F)    
Eth        2506.9   1 12.3161 0.0006134 ***
Sex         189.6   1  0.9316 0.3362005    
Age        2645.8   3  4.3328 0.0059960 ** 
Lrn        1104.2   1  5.4248 0.0213601 *  
Eth:Age    2822.4   3  4.6220 0.0041474 ** 
Sex:Age    2331.7   3  3.8184 0.0115675 *  
Residuals 27072.1 133
Sex is the only nonsignificant term here, but we can't delete it because it is involved in one of our significant interactions. Aitkin also deleted the Lrn term from the model, but doing so is not advised by the following test.
> model2 = aov(Days ~ Eth + Sex + Age + Eth:Age + Sex:Age, data=quine)
> anova(model2, model)
Analysis of Variance Table

Model 1: Days ~ Eth + Sex + Age + Eth:Age + Sex:Age
Model 2: Days ~ Eth + Sex + Age + Lrn + Eth:Age + Sex:Age
  Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
1    134 28176                              
2    133 27072  1    1104.2 5.4248 0.02136 *
Notice this is equivalent to the Type II test in the previous table.

If at this point an ignorant journal editor says, "No. I want Type III sums of squares," here they are.

> source("aovIII.R")              # in case you deleted it
> aovIII(Days ~ Eth + Sex + Age + Lrn + Eth:Age + Sex:Age, data=quine)
Single term deletions

Model:
Days ~ Eth + Sex + Age + Lrn + Eth:Age + Sex:Age
        Df Sum of Sq   RSS    AIC F value   Pr(>F)   
<none>               27072 788.51                    
Eth      1    1753.9 28826 795.67  8.6167 0.003926 **
Sex      1     148.0 27220 787.30  0.7269 0.395411   
Age      3    3466.5 30539 800.10  5.6767 0.001090 **
Lrn      1    1104.2 28176 792.34  5.4248 0.021360 * 
Eth:Age  3    2822.4 29895 796.99  4.6220 0.004147 **
Sex:Age  3    2331.7 29404 794.57  3.8184 0.011567 *
And I would have no qualms about letting him try to figure out where that wool over his eyes came from. Notice, by the way, that AIC also has us retaining the Lrn term in this model. Here's a question for you. Why are the Type II and Type III sums of squares the same for Lrn? (Because Lrn is entered last in both tests.)

Here are some diagnostics on the model, suggesting considerable heteroscedasticity. Venables and Ripley's log transform might be a good idea.

> par(mfrow=c(2,2))
> plot(model, c(1,2,4))
Quine Diagnostics
Nevertheless, the interactions are obvious, and in my opinion more interesting, on the Days scale.
> par(mfrow=c(1,2))
> with(quine,interaction.plot(Age,Sex,Days,type="b",pch=1:2,bty="l"))
> with(quine,interaction.plot(Age,Eth,Days,type="b",pch=1:2,bty="l"))
Quine Interactions

Now we can see why talking about main effects in the presence of interactions is inappropriate. It would certainly be an injustice to the Aboriginal kids in F0 and F3 to make a global statement of the sort, "Aboriginal kids miss more days of school, on average, than non-Aboriginal kids." It would also be misleading, even though there is no main effect of Sex, to say, "There is no difference, on average, in attendance by boys vs. girls." This may not be true in any of the age groups.

Out of curiosity, I also plotted the nonsignificant 2-way interactions.

Nonsignificant

I'll leave you to draw your own conclusions. One question you might have, if you remember something I said above, is how are we justified in looking at these effects on profile plots? After all, judging effects from a profile plot in the way we've all been taught to do so is basically doing a Type III analysis in our heads. And didn't we use Type I and Type II tests to get here? Yes, and this is why I was relieved that, with the higher level interactions deleted, the Type II and Type III tests on the currently highest level interactions are the same. The only question we might have at this point is whether it's justified to drop interactions from a Type III model. In Type III analyses, the various levels of effects (main, 1-way, 2-way, etc.) are not orthogonal. Therefore, when you drop a higher level effect, you are necessarily changing the lower level effects. Well, I think we should worry about the lower level effects when we get there. Or, if you're not happy with that answer, then I'll justify dropping the interactions based on the Type II tests. (Note: Seriously though, you're not supposed to drop interactions from Type III tests. Supposedly, the lower order effects are being tested "correctly" in the presence of the higher order effects when Type III tests are used. More on this below.)

Note: Milligan et al, 1987, assessed the robustness of all three methods of analysis (i.e., all "Types") to heterogeneity of variance when the design is unbalanced and found all three to be not at all robust. The worst patterns were when cell variances were correlated with cell sizes. "The user is cautioned against collecting such data." This is not a problem unique to unbalanced factorial designs, however. It has been well known in the case of unbalanced single-factor designs for some time. Which doesn't mitigate its seriousness.

A final note: Notice that the model we ended up with, with the four main effects and two interactions, is pretty close to what the Type II tests gave us right from the get-go. So why did I put myself through so much rigmarole? Because I understand this analysis. I do not understand an ANOVA table that has eleven interactions in it. Also, in the original analysis, I was doing 15 tests all at an alpha of .05. Now I'm doing 6, and all but one are significant. Familywise error can take a hike!

  • Aitkin, M. (1978). The analysis of unbalanced cross classifications (with discussion). Journal of the Royal Statistical Society series A 141, 195-223.
  • Fox, J. and Weisberg, H. S. (2011). An R Companion to Applied Regression (2nd ed.). Thousand Oaks, CA: Sage.
  • Milligan, G. W., Wong, D. S., and Thompson, P. A. (1987). Robustness properties of nonorthogonal analysis of variance. Psychological Bulletin, 101(3), 464-470.
  • Quine, S. (1975). Achievement orientation of aboriginal and white Australian adolescents. Unpublished Ph.D. Thesis, Australian National University.
  • Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). New York: Springer.
    Footnote: I had a friend run the Quine data through SPSS version 22 for me. She was able to produce a full set of Type III and Type IV sums of squares. Those results are posted here in a PDF file. Notice the Type III and Type IV tests gave different results. I'm not really sure why, as my entire knowledge of Type IV sums of squares was obtained off a matchbook cover that I picked up in a bar one night. It said, "For a good time call Dolly 555-3434." I have no idea what it means.

Some Notes on the Justifications for Deleting Interactions

First of all, perhaps I should say that "deleting interactions" is a misstatement. They're not actually being deleted, they're being pooled with the error term. The literature on pooling interactions with the error term is almost as confusing as the literature on unbalanced designs!

The primary argument against pooling/deleting (I'll say deleting) is, how do you know when to do it? The following arguments have been made. (Note: That I am aware of. I'm not going to plow through my stack of photocopied journal articles--yes, I'm old school--to find the appropriate references, but these points should NOT be credited to me!)

1) A significance test that fails to produce p<.05 is not an adequate reason for rejecting the existence of an interaction. (In your face, Neyman and Pearson!) 2) If you have every reason to believe that an interaction is unimportant, then it shouldn't have been included in the model to begin with. 3) Deleting interaction terms (i.e., pooling their variability with error variability) reduces the power of the tests for main effects. 4) In ten-way or fourteen-way ANOVAs, a few significant interactions can get lost in a sea of nonsignificant ones. 5) Deleting an interaction makes the implicit assumption that its magnitude in the population is zero. This is unlikely to be true. 6) You may have failed to find an interaction that actually exists because your test lacks power. 7) Deleting interactions in Type III tests produces an incorrect error term for testing main effects.

To which I would reply: 1) Correct. As I've said repeatedly in this article, nonsignificant does not mean nonexistent. 2) Not including it automatically pools it with the error term in any kind of ANOVA that I'm aware of. 3) If you've decided, for either practical reasons or for theoretical ones, that an interaction "doesn't exist," then the variability associated with it, in your opinion, is to be treated as random error, and random error BELONGS in the error term. You're not reducing your power by pooling those terms, you're artificially inflating your power by not pooling them. On the other hand, if you're not really certain, but you have an interaction that you've just decided "not to make a fuss over," then that's a harder case. Not deleting "just in case" will (perhaps artificially) maximize the power of your main effects tests, while deleting will produce a conservative test. 4) Seriously? Who does this? 5) Well, yeah. But let's face it, statistics is a messy business. We make all kinds of assumptions that are false, like homogeneity of variance, for example. How likely is it that homogeneity is exactly true? It doesn't need to be, it just needs to be close enough. I don't claim to be perfect. I'm just trying to figure out how to do the best I can, and if I think deleting an interaction will help me do that, then that interaction is gone! The cardinal sin here would be failing to inform my readers of what I've done and why. If I'm truthful about what terms I've dropped and why, then they can decide for themselves if what I've done is justified. 6) Whose fault is that? 7) Gasp! These very same people have sometimes then turned right around and argued that nonsignificant interactions in Type II tests MUST be deleted in order to provide the correct error term for main effects tests. As we've seen, deleting the interactions in Type III and Type II tests produces exactly the same main effects tests. Since I'm reserving Type III tests for mildly unbalanced experimental data (and this is one reason why!), I'll just say, don't delete your interactions in those cases!

The only one of these arguments that I'll take seriously is the first one. As I've already said, just because a significance test fails to produce a p-value that slips below some magical criterion, that's not a good reason for deciding that something doesn't exist. Statistical analysis is an intellectual exercise. Although it can be done without thinking, especially in these days of high powered statistical software, it certainly shouldn't be.

Here are my arguments in favor of deleting interactions that you've decided, for one reason or another, are unimportant.

  1. As we've seen, deleting the interaction terms will often result in Type II and III tests producing the same tests. This eliminates the need for the Type II people and the Type III people to argue. (Okay, just kidding about this one. They're going to argue anyway.)
  2. Reducing a model to the minimal adequate one (Crawley, 2005; see Refs) not only reduces the complexity of the model (parsimony), making it easier to interpret, but also reduces the false positive rate. (Crawley has even been known in some of his examples to combine nonsignificantly different levels of a factor. I don't know that I would go that far. Also, I'm not suggesting this model reduction strategy for data from true experiments, although I'm not not suggesting it either. One of my fellow grad students used to say, "Occam's razor cuts in both directions!" By the way, I also disagree with Crawley's insistence on p=.05 as being the absolute determinant of what stays and what goes.)
  3. It's been suggested, especially in sequential/hierarchical analyses (i.e., Type I and II), that one way to produce publishable significant effects is to add a lot of spurious terms to the model. Many of these terms will have some tiny little effect greater than zero and, therefore, will artificially deplete the error term. The suggestion was not to use these methods of analysis, but I might propose that an alternative would be not to believe models that have a lot of nonsignificant effects in them. If you don't believe an effect is important, dump it into the error term and convince me that your significant effects really are important.
  4. It's been argued that deleting nonsignificant interactions is not only possible with Type I and Type II tests (e.g., Herr and Gaebelein, 1978), but is actually required in order to produce the correct error term for testing main effects (e.g., see Rawlings, 1972; Overall and Spiegel, 1973; Lewis and Keren, 1977). I haven't plowed through the mathematics behind this argument, so I can't pass judgment on it, but it sounds like it might be right. I'll refer you to the cited references for details.
  • Herr, D. G. and Gaebelein, J. W. Nonorthogonal two-way analysis of variance. Psychological Bulletin, Vol 85(1), Jan 1978, 207-216.
  • Lewis, C. and Keren, G. You can't have your cake and eat it too: Some considerations of the error term. Psychological Bulletin, Vol 84(6), Nov 1977, 1150-1154.
  • Overall, J. E. and Spiegel, D. K. Comments on Rawlings' nonorthogonal analysis of variance. Psychological Bulletin, Vol 79(3), Mar 1973, 164-167.
  • Rawlings, R. R. Note on nonorthogonal analysis of variance. Psychological Bulletin, Vol 77(5), May 1972, 373-374.

An Educational Example of Nonexperimental Data

(Important preliminary note: As quasi-experimental designs go, this one is not very complex. It is quasi-experimental only in that it does not make use of random assignment to treatment conditions. Since the study looked at intact groups without any further attempt at control, some people might find it more reasonable to call this an observational study.)

A few years ago (Spring 2013), Zach Rizutto, a student in our department, did a very interesting senior research project. Zach administered a survey to CCU students, which collected information on several factors about them. He also administered the Rosenberg Self-Esteem Scale. Most of the questions were included to cleverly disguise his real interest, which was to see if there is a relationhship between body cutting and scarification and self-esteem. He collected information on these variables:

  • Age - age in years
  • Sex - gender (F/M)
  • Tattoos - does subject have tattoos?
  • Num.Tattoos - how many tattoos
  • Age.1st.Tat - age at first tattoo
  • Piercings - does subject have piercings?
  • Num.Piercings - how many piercings
  • Age.1st.Pierce - age at first piercing
  • Cut.Scar - does subject engage in self cutting or scarification?
  • Age.1st.Cut - age at first cutting
  • RSES - Rosenberg Self-Esteem Score
Our interest is in Sex, Cut.Scar, and RSES. I no longer have access to the original data, but I helped Zach with his data analysis, and I still have my notes, including R output from the analysis. (FOUND! See the note at the end of this section for how to download it.) Here is some summary information.
> summary(scar)
 Sex    Cut.Scar       RSES      
 F:80   No :124   Min.   : 7.00  
 M:60   Yes: 16   1st Qu.:19.00  
                  Median :23.00  
                  Mean   :22.62  
                  3rd Qu.:27.00  
                  Max.   :30.00  

> with(scar, tapply(RSES, list(Cut.Scar, Sex), mean))
           F        M
No  22.75385 23.32203
Yes 18.93333 28.00000

> with(scar, tapply(RSES, list(Cut.Scar, Sex), var))
           F        M
No  21.81346 25.18761
Yes 24.78095       NA

> with(scar, tapply(RSES, list(Cut.Scar, Sex), length))
     F  M
No  65 59
Yes 15  1

> with(scar, tapply(RSES, Sex, mean))            # weighted means Sex
      F       M 
22.0375 23.4000 

> with(scar, tapply(RSES, Cut.Scar, mean))       # weighted means Cut.Scar
      No      Yes 
23.02419 19.50000

> with(scar, tapply(RSES, list(Cut.Scar,Sex), mean)) -> means
> colMeans(means)                                # unweighted means Sex
       F        M 
20.84359 25.66102 
> rowMeans(means)                                # unweighted means Cut.Scar
      No      Yes 
23.03794 23.46667
Profile Plot You've seen the interaction plot already, but just as a reminder, here it is again. Zach collected information from 140 students, not a bad sample size considering he had only one semester to complete the study. Out of 140 subjects, he found one male cutter. The whole analysis hinges on how representative that one score is. For the moment, let's assume that our one sampled male cutter represents the population of male cutters.

Thus, we can see from looking at the interaction (profile) plot that we have an interaction between Sex and Cut.Scar. We see that there is not much difference in the self-esteem of male and female noncutters, t(122) = 0.653, p = .515, two-tailed. There appears to be a much larger difference in the self-esteem of male and female cutters. Male cutters have higher self-esteem than male noncutters, but female cutters have lower self-esteem than female noncutters. Hence, these simple effects go in opposite directions, and this qualifies as a disordinal interaction, as discussed previously.

However, due to the small sample size of cutters, and the unbalanced nature of the data, the Sex difference at this level of Cut.Scar is also not statistically significant, t(14) = 1.76, p = .0996, two-tailed. The difference of over 9 points is about two standard deviations, so the effect size is very large, and if this condition had been more closely balanced, this would have been a highly significant difference, p = .004. But the population isn't like that, or so we are led to believe from this sample. There appear to be many more female cutters than male cutters, chi-sq = 8.27, df = 1, p = .004, with a Yates correction for continuity.

Here are the Type I and Type III ANOVA summary tables.

Type I Sex first
              Df  Sum Sq  Mean Sq F value   Pr(>F)  
Sex            1   63.65  63.6482 2.70178 0.102546  
Cut.Scar       1  133.69 133.6932 5.67509 0.018591 *       # equivalent to Type II
Sex:Cut.Scar   1   65.72  65.7180 2.78964 0.097176 .
Residuals    136 3203.88  23.5579                   

Type I Cut.Scar first
              Df  Sum Sq  Mean Sq F value   Pr(>F)   
Cut.Scar       1  176.01 176.0083 7.47130 0.007103 **
Sex            1   21.33  21.3332 0.90556 0.342983         # equivalent to Type II
Cut.Scar:Sex   1   65.72  65.7180 2.78964 0.097176 . 
Residuals    136 3203.88  23.5579

Type III
              Df  Sum Sq  Mean Sq F value  Pr(>F)  
Sex            1   84.47    84.47  3.5855 0.06041 .
Cut.Scar       1    0.67     0.67  0.0284 0.86643  
Sex:Cut.Scar   1   65.72    65.72  2.7896 0.09718 .
Residuals    136 3203.88    23.56
Both Type I and Type II tests find a significant difference in RSES by Cut.Scar. The Type III test doesn't even come close, not surprising given the tiny difference in the unweighted marginal means. Neither Type I nor Type II finds a significant difference in RSES by Sex, but Type III nearly does, and that makes sense from an examination of the profile plot. The impressive looking interaction is not significant, but it's not a whole long way off either. How do we reconcile these disparate reports?

First, let me address the interaction for a minute. I said above that a design loses power the more unbalanced it becomes. Had the cutters been more equally balanced between males and females, would we be looking at a significant interaction here? Maxwell and Delaney (see Refs) show that a 2x2 interaction in a nonorthogonal design can be calculated as follows.

SSAB = harm.mean(n) * (M11 - M12 - M21 + M22)2 / 4
So...
> harm.mean=function(x) 1/mean(1/x)    # I keep erasing it!
> harm.mean(c(65,59,15,1)) * (22.75385 - 18.93333 - 23.32203 + 28.00000)^2 / 4
[1] 65.7182
If the design had been more balanced with respect to cutters (and assuming homogeneity of variance is preserved), the interaction would test like this.
> harm.mean(c(65,59,8,7)) * (22.75385 - 18.93333 - 23.32203 + 28.00000)^2 / 4
[1] 240.5947
> 240.5947 / 23.56                # get an F
[1] 10.212
> pf(10.212, 1, 136, lower=F)     # get the p-value
[1] 0.001734847
In fact, if we'd had 14 female cutters and 2 male cutters, preserving the means and homogeneity of variance, the SSAB = 119.6238, and F(1,136) = 5.08, p = .03. We are paying a heavy price for having such an unbalanced design, but if this is what nature looks like, then it is a price we have no choice but to pay. (All of this assumes, of course, that our one male cutter is representative of male cutters. It is NOT, however, making a statistical error with respect to the error term. If I turned some of my female cutters into male cutters, the sum of squares would go down in the Female-Y cell and up in the Male-Y cell. Assuming homogeneity is preserved, the sum of these within cell SSes would remain close to the same.)

I remember suggesting a drop1() on the Type I analysis. I still have the result of that in my notes. (Apparently, I don't throw anything away but the data!!)

> drop1(aov.out, test="F")
Single term deletions

Model:
RSES ~ Sex * Cut.Scar
             Df Sum of Sq    RSS    AIC F value  Pr(>F)  
<none>                    3203.9 446.27                  
Sex:Cut.Scar  1    65.718 3269.6 447.11  2.7896 0.09718 .
AIC is suggesting that the interaction should be retained in the model, and thus, to preserve marginality, both main effects along with it. Nothing is recommended for deletion. If we retain the interaction, then the interaction is the effect of interest, and the main effects become none of our concern. Problem solved?

If we drop the interaction, then both the Type II and Type III tests come out like this.

             Df Sum Sq Mean Sq F value Pr(>F)  
Sex           1     21   21.33   0.894 0.3461
Cut.Scar      1    134  133.69   5.602 0.0193 *
Residuals   137   3270   23.87
Problem solved? I told Zach to talk about the interaction, but to emphasize the influential nature of that one male cutter. Thus, HIS problems were solved. I don't think it solves ours by a long shot! Looking at the interaction plot, we have to wonder how ANY set of tests can tell us that we have a significant main effect of Cut.Scar, but nothing for Sex.

First of all, the Type I tests with Sex entered first (because we wanted to see the effect of Cut.Scar with Sex removed) are telling us there is no significant effect of Sex because there isn't. Sex by itself, ignoring everything else, is not a significant predictor of RSES, t(138) = 1.61, p = .11, two-tailed. This is very similar to the ANOVA result. As we've seen, it's also not significant at either level of gender considered alone. The effect means being tested by the Type I test, as well as by the t-test, are 22.0375 and 23.4000. Not much difference there. Is that interesting? Heck yeah, that's interesting! I thought the women would come out a little lower, because they usually do. But they didn't. Good for them! Maybe things are changing here at the good ol' U of CC.

Second, the Type I tests with Cut.Scar entered first are telling us there is a significant effect of Cut.Scar (ignoring Sex) because there is, t(138) = 2.72, p = .007, two-tailed, testing the means 23.02419 and 19.50000. Overall, cutters have lower self-esteem than noncutters. Fewer of those cutters are men, but who was asking? "But what if there were more male cutters?" Well, there aren't!

That's the Type III question. What if there were the same number of male cutters as female cutters, and the same number of cutters as noncutters? That would be a very sensible question to ask if these were experimental data with randomly assigned subjects. But they're not. Provided the sampling was done adequately, these cell sizes represent something that's true in nature. There are many more female cutters than male cutters, and many more noncutters than cutters.

"Oh, I see! If I want to see the 'pure effect' of a variable, I look at the first line, and if I want to see the effect of one thing controlled for the other, I look at the second line. I get it now!" Well, just hold up a minute. It's not that simple. The second line of the Type I tests are, in effect, the Type II tests. Supposedly, the Type II tests on the main effects do control one variable for the other, i.e., the second lines of the Type I tests are "B given A," "B controlled for A," "B after A is removed," etc. But the Type III tests also claim to control B for A, and as we can see, there is a huge difference in what we're being told by the two tests.

I'm going to defer at length to John Fox. Dr. Fox is the creator and maintainer of the "car" package and a major advocate of Type II tests. The following was posted at the R-help mailing list.

(1) What's important in formulating tests are the hypotheses being tested.
These should be sensible and of interest. Take a crossed two-way ANOVA, for
example, for factors A and B. In that context, hypotheses should address the
pattern of cell means.

(2) The null hypothesis of no interaction is that the profiles of cell means
for A (say) across the levels of B are parallel. If the[r]e is no interaction,
then testing equality of any weighted average of the profiles of cell means
across the levels of B will test the null hypothesis of no A main effects.
The most powerful test for the A main effect is the type-II test: for A
after B ignoring the AB interaction; (and continuing) for B after A ignoring
AB; and for AB after A and B. These tests are independent of the contrasts
chosen to represent A and B, which is generally the case when one restricts
consideration to models that conform to the principle of marginality.

(3) If there is interaction, then what one means by main effects is
ambiguous, but one possibility is to formulate the main effects for A in
terms of the marginal means for A averaged across the levels of B. The null
hypothesis of no A main effects is then that the A marginal means are all
equal. This is the type-III test: for A after B and AB; (and continuing) for
B after A and AB; and AB after A and B. Because the tests for the main
effect violate the principle of marginality, to compute these tests properly
requires contrasts that are orthogonal in the row-basis of the model -- in
R, e.g., contr.sum or contr.helmert, but not the default contr.treatment.
The type-III tests have the attraction that they also test for main effects
if the interactions are absent, though they are not maximally powerful in
that circumstance. There's also a serious question about whether one would
be interested in main effects defined as averages over the level of the
other factor when interactions are present. If, however, interactions are
present, then the type-II tests for A and B are not tests of main effects in
a reasonable interpretation of that term.

(4) The type-I tests are sequential: for A ignoring B and their interaction;
for B after A and ignoring the interaction; and for AB after A and B. These
tests compare models that conform to marginality and thus are independent of
the contrasts selected to represent A and B. If A and B are related (as
happens when the cell counts are unequal) then the test for A does not test
the A main effect in any reasonable interpretation of that term -- i.e., as
the partial relationship between the response and A conditional on B.

(5) Other, similar, issues arise in models with factors and covariates.
These are not typically handled reasonably for type-III tests in software
such as SAS and SPSS, which, e.g., test for differences across the levels of
a factor A when a covariate X is 0.

I cited the whole thing because it's an excellent summary of this controversy, but I call particular attention to one statement: "If, however, interactions are present, then the type-II tests for A and B are not tests of main effects in a reasonable interpretation of that term." In other words, Type II tests for main effects DON'T WORK in the presence of interactions. This same point was made in a somewhat less clear-headed way in the last section of this article. Since we have decided that the interaction is important (I've decided that), we shouldn't be looking at main effects in the Type II tests. We shouldn't be looking at them in the Type III tests either, but at least the claim has been made (in other sources) that the Type III tests test them "correctly" in the presence of an interaction.

"So are the Type III tests correct here after all?" Nope! "Should we have pooled the interaction with the error term if we want to look at the main effects?" Certainly not! An interaction that is considered important should never be dumped into the error term!

Two points. (I'm getting behind in my "points" here!) First, I disagree with Dr. Fox that the test on A (first factor entered) in the Type I tests "does not test the A main effect in any reasonable interpretation of that term." Sure it does! And I hope I've already demonstrated that. The A effect does not have to be "the partial relationship between the response and A conditional on B." A ignoring B is a perfectly reasonable definition of an effect. It all depends on what you're interested in looking for, i.e., what your hypotheses are. (Okay, let's not get tangled in terminology here. If you want to reserve the term "main effect" for "A conditional on B," fine! But A ignoring B is a perfectly reasonable definition of an EFFECT of A, and it's one that might very well be interesting.)

Second, the question of whether or not Type III tests test main effects "correctly" in the presence of an interaction once again depends upon what you think a main effect is. If by "main effect" you mean an effect that answers the question, "What if there were the same number of male cutters as female cutters, and the same number of cutters as noncutters?" then sure, Type III tests are reasonable. But is that a reasonable question? Nature just isn't like that. Why would anyone want to pretend it is? Let me point out, it's not our analysis that is creating the confound between Sex and Cut.Scar. These effects are confounded in nature! Artificially removing the confound by some statistical trick seems to me to be doing a disservice to the phenomenon we are investigating, all in the interest of making our statistical analysis "easier to interpret."

Before I leave this point, let me continue beating up on the Type III tests for just a minute. Some people have attempted to define Type III tests into correctness by defining main effects as the difference between unweighted marginal means. I'll go along with that as long as they are restricting themselves to the discussion of experimental data, but they don't make that distinction (in the sources I've read). They would have us believe that this is always the definition of a main effect, and if that's so, then there is no such thing as a main effect in unbalanced quasi-experimental data, because unweighted marginal means don't supply meaningful information in those cases. Another bogus argument sometimes made in favor of Type III is that they must be correct because they give the same result as a regression analysis does. True, provided the regression analysis is done with sum-to-zero contrasts, but so what? This is another "fixed" argument, because the two analyses do exactly the same thing internally and test the same hypotheses on unweighted marginal means. So, duh! Of course they produce the same result. And if the unweighted marginal means don't give us useful or meaningful information, then they are both wrong. (Try the regression with treatment contrasts and see what you get.)

Let's get rid of the interaction, and I don't mean dump it into the error term, I mean get rid of it. We've been agonizing over that one male cutter and whether or not he is representative of all male cutters. Now we're going to assume he's not. Now we're going to assume that male cutters are more like female cutters. I did that by changing the one male cutter's score from 28 to 23 (leaving some reasonable random error in the interaction term). Here is the analysis.

Type I Sex first
              Df  Sum Sq  Mean Sq F value Pr(>F)   
Sex            1   56.10  56.1006 2.38139 0.1251   
Cut.Scar       1  166.86 166.8578 7.08288 0.0087 **
Sex:Cut.Scar   1   11.14  11.1368 0.47274 0.4929   
Residuals    136 3203.88  23.5579

Type I Cut.Scar first
              Df  Sum Sq  Mean Sq F value Pr(>F)   
Cut.Scar       1  208.61 208.6065 8.85505 0.0035 **
Sex            1   14.35  14.3519 0.60922 0.4364   
Cut.Scar:Sex   1   11.14  11.1368 0.47274 0.4929  
Residuals    136 3203.88  23.5579

Type III
              Df  Sum Sq  Mean Sq F value Pr(>F)   
Sex            1   19.55    19.55  0.8297 0.3640
Cut.Scar       1   15.62    15.62  0.6628 0.4170
Sex:Cut.Scar   1   11.14    11.14  0.4727 0.4929
Residuals    136 3203.88    23.5579
The interaction is now nowhere near being significant, so here are the Type II tests with the interaction deleted (pooled).
             Df  Sum Sq  Mean Sq  F value  Pr(>F)   
Sex           1   14.35  14.3519   0.6116  0.4355
Cut.Scar      1  166.86 166.8578   7.1102  0.0086 **
Residuals   137 3215.01  23.4672
We could have predicted that--only a very slight change due to dumping one more degree of freedom and a small amount of variability into the error term. The error mean square actually got a bit smaller, as happens when a dropped term has an F-value near or less than one. Gratifyingly, the effect sums of squares did not change, because deleting a term from the statistical model does not change the size of the effects in the data.

But the Type III test is still telling us there is no main effect of Cut.Scar, while the Type II test is telling us it is highly significant. (It is now a moot point that the Type III tests become the same as the Type II tests if the interaction is deleted, as it is SUPPOSEDLY true that the Type III tests are calculating the main effects "correctly" in the presence of the interaction term, and NOT OTHERWISE.)

So if the interaction term had been obviously nothing but random error, the situation wouldn't have been any clearer regarding the discrepancy between the Type II and Type III tests. In fact, it might actually have been a little worse.

Let's go back to the original data.

We know everything we need to know. Time to stop waffling about and make a decision as to how these data should be analyzed! First, the Type III tests are clearly wrong, or at the very best, clearly uninteresting. Okay, they found the same interaction as the other Types did, and if all we're going to look at is the interaction, then doing the Type III tests is not a disaster. But the Type III tests make an assumption about the population that is CLEARLY FALSE, namely that all these groups are the same size.

With experimental data, that is a reasonable assumption, maybe even a necessary assumption. Indeed, with experimental data, one might well ask, "Dude! Why aren't all your groups the same size?" If we were to ask that same question of these data, the answer would be, "Because they're not!"

"Yeah, but isn't it necessary to make that assumption to remove the confounding..." NO, it is not, and I've already demonstrated that! (Okay, you may have a tiny little point here, and I'll come back to it.) But for the moment, let me continue hammering this point. Type III tests treat all the cell means equally--perfectly reasonable for experimental data. But with data from intact groups, that means Type III tests treat the subjects unequally, in that subects from smaller cells are more important to the analysis than are subjects from larger cells. Let me ask you this. Do you consider that one male cutter to be 59 times as important as any of the male noncutters, or 65 times as important as any of the female noncutters? No? Then why would you treat him as such in your statistical analysis? If you had to bet your hard earned money on which of those cell means most accurately reflects what is true in the corresponding population, would you pick a cell that has one subject in it, or a cell that has 65 subjects in it? Okay, this is an extreme case, but how is the logic any different with less extreme cases? I'm not even going to wait for your answer!

That leaves us to choose between the Type I and Type II tests. Type I/II tests would be the clear choice for complicated cases such as the analysis of the Quine data, because they allow model simplification, and model simplification is good for reasons already listed. Type III tests supposedly do not allow model simplification, although as we've seen, Type III tests become Type II tests when the model is simplified. So if you've stumbled into using Type III tests with nonexperimental data, SIMPLIFY THAT MODEL! But what about a simple little two-factor case like this one?

Once again if our sole interest is in the interaction, flip a coin. Both Types find the same interaction. If we can safely assume that the interaction doesn't exist, then once again it depends on what we want to find out, i.e., what hypotheses we want to test. If all we're interested in is A|B and B|A, then Type II is the way to go. If we're interested in finding out something about what A is like in the population regardless of how it crosses with B, then Type I is our test. If we want both of those things, then do them both, or run the Type I tests twice. All of these hypotheses make sense in the absence of an interaction.

What if we have a case like this one where the interaction is a little shaky, but we don't want to ignore it, or don't feel safe ignoring it? I could say go out and find some more male cutters and find out what male cutters are like. That would certainly be ideal. But we've run out of grant money, and our tenure decision is imminent, and we really need to get this into press. (The real world, unfortunately, just ain't an ideal place!)

The purpose of Type II and Type III tests is to find an interaction, and in the ABSENCE of an interaction, to find main effects. If we're uncertain about the interaction, then I can't see how either of them is useful. Thus, the only correct tests here--and this is once again MY HUMBLE OPINION--are the Type I tests. Furthermore, I may have advised Zach incorrectly. If the Type I tests could only have been done with one order of factors, then perhaps Cut.Scar should have been entered first. After all, the original question was, what is the relationship of self-cutting to self esteem? We found in our sample that self esteem was significantly lower in cutters than in noncutters. Question answered. However, we've also found that there may be a confound with Sex, in that it's possible, although highly uncertain at this point, that the effect is different in males than it is in females. This issue can only be resolved through further research.

Type I Cut.Scar first
              Df  Sum Sq  Mean Sq F value   Pr(>F)     inc. in R^2
Cut.Scar       1  176.01 176.0083 7.47130 0.007103 **     0.0508
Sex            1   21.33  21.3332 0.90556 0.342983        0.0062
Cut.Scar:Sex   1   65.72  65.7180 2.78964 0.097176 .      0.0190
Residuals    136 3203.88  23.5579
             139 3466.94  ### I wish R would print this!

Type I Sex first
              Df  Sum Sq  Mean Sq F value   Pr(>F)     inc. in R^2
Sex            1   63.65  63.6482 2.70178 0.102546        0.0184
Cut.Scar       1  133.69 133.6932 5.67509 0.018591 *      0.0386
Sex:Cut.Scar   1   65.72  65.7180 2.78964 0.097176 .      0.0190
Residuals    136 3203.88  23.5579
             139 3466.94

However, the best thing to do here would be to do the Type I tests using both orders of factors. Entering Cut.Scar second directly addresses the issue of the possible confound, i.e., if we take Sex out first, does the effect of Cut.Scar go away. We can't really say. It all hangs on how representative that one male cutter is. If we triple our sample size and find the same effects, then allowing for a similar variance in the Male-Yes cell, the analysis would look like this.

              Df   Sum Sq Mean Sq  F value     Pr(>F)      inc. in R^2
Cut.Scar       1   528.02 528.025 22.73513 2.5758e-06 ***     0.0505
Sex            1    64.00  63.999  2.75562   0.097667 .       0.0061
Cut.Scar:Sex   1   197.15 197.154  8.48885   0.003766 **      0.0189
Residuals    416  9661.63  23.225
             419 10450.80

              Df   Sum Sq Mean Sq  F value     Pr(>F)    
Sex            1   190.94 190.945  8.22149  0.0043502 **      0.0183
Cut.Scar       1   401.08 401.080 17.26926 3.9398e-05 ***     0.0384
Sex:Cut.Scar   1   197.15 197.154  8.48885  0.0037660 **      0.0189
Residuals    416  9661.63  23.225
             419 10450.80
What does it mean? The second analysis was done on the first data set tripled in size, except in cell Male-Yes, where the mean remained the same, but the scores were varied enough to give a variance of 25. This increased the total SS and error SS by a little more than 3 times, but not surprisingly, the effects SSes are exactly 3 times larger, and the effect sizes are almost the same. The interaction is now clearly significant, so would the interpretation be any different?

I don't see why it should be. We do have the option now of treating this like any other ANOVA problem and looking solely at the interaction by examining simple effects (see below). But it seems to me that we lose something by treating it that way.

I had a "theory" (or hypothesis I suppose) about how this study was going to turn out, and I had it as soon as Zach explained to me what he wanted to do. Over the many years I've worked here, I've seen a lot of senior research projects that used self esteem scores as the DV. I've never seen many of them turn up very much in the way of interesting results, but the one effect you can usually count on is an effect of gender. Young women (at least here at CCU) typically score a little lower on self esteem measures, usually by a few points on the average. And I expected that the vast majority of cutters would be young women. So I thought it might be true that Zach would see "an effect of Cut.Scar on self esteem," but I thought part if not, in fact, a good bit of that effect would be due to the gender confound.

That's why I think the Type I analysis is the correct one here. First we run it entering Cut.Scar first, or perhaps Cut.Scar only (it doesn't make much difference), and we see that there is an effect of Cut.Scar. It's a highly significant effect, but it's not a particularly large effect. Aha! That's because it's actually an effect of Sex, I thought. So if we take Sex out before looking at Cut.Scar (i.e., enter Sex first), what happens to the effect of Cut.Scar? It gets a bit smaller, but it doesn't go away. It's still significant.

Why couldn't we do that with Type II tests? Because Type II tests would not allow us to see how the effect of Cut.Scar changes when Sex is entered first (as compared to second). Also, if we're suspicious of that Sex effect when Sex is entered first (p=.1), we can see that it shrinks considerably when the effect of Cut.Scar is taken out (entered first).

Do we get anything else with the Type I tests that we wouldn't get with Type II/III? We can see that Sex and Cut.Scar are related to (confounded with) one another. If they weren't, flipping the order in which they are entered would have no effect. The tests would come out the same, as they did in the proportionally balanced design. Is there any way to tell how strongly they are related from the two Type I tables? Probably, but I don't know what it is. All I can tell you is, the greater the change caused by flipping, the more important the relationship is to understanding the phenomenon under investigation.

I'll admit that Type I tests can be confusing when you have a high order design, such as we saw in the Quine data. But for simple two-factor designs like this, it seems to me that you get a lot more out of the Type I tests than you do out of the Type II or III tests, if you know what to look for.

What about that marginally significant interaction? I doubt that's real, but let's suppose for a moment it is. Should we then even be looking at the main effects? I'm going to say something a little heretical here, but it makes sense to me. These are not main effects in the traditional sense of the words (in the "John Fox sense"). I've already argued that, in the presence of an interaction, traditional main effects are largely uninterpretable. But these are not "traditional main effects" in the sense that ANOVA people want them to be, as John Fox pointed out. These are wholly independent sequential effects.

Even in the table resulting from the tripled sample size, where the interaction is highly significant, we are still getting information from the test on A (Cut.Scar). In the population, there is a relationship between self esteem and self-cutting. The interaction doesn't replace that effect, it adds information to it. If the A effect is not a main effect in the traditional sense, then don't call it a main effect. But it's still an effect. The effect of Cut.Scar is not as large if we control it for Sex, but it's still there. And, in addition to all that, there is a Cut.Scar-by-Sex interaction. (I admit, I'm having a bit of a problem with the B effect.)

Type I tests seem to me to be less like traditional ANOVA--and I'm sure the ANOVA people will agree!--and more like hierarchical regression (see above) and Mediation Analysis (see below), both techniques that have been important in trying to understand observational data.

Three more points: 1) "What about the inflation we were talking about above?" Not relevant, I don't think. Because we're no longer talking about traditional ANOVA main effects. (I could be wrong. It's been known to happen!) 2) "Entering B after A in Type I (or II) tests does not remove ALL variability due to A from B." It does not. The AxB interaction is partly due to A, and it has not been removed at this stage. I'm sure there are situations where it is desirable to do so (true experiments), but I'm not sure this is one of them. You might say that the effect of B after A is B + AxB. But without B, there is no AxB. If we remove AxB from A or B, what the heck is it that we end up looking at? 3) "You say it's undesirable to look at the data as if all groups were the same size, but that's exactly what you do when you look at AxB." Yes, because by that stage of the analysis, it's necessary. We've already looked at A and B. Now we want to see what AxB adds to that. That means A and B have to be taken out entirely from AxB.

NOTE: I found Zach's data! It was on an old computer that I no longer use, but fortunately still have. (Never throw anything away!) You can get it like this (type or copy and paste these lines).

file = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/scar.csv"
scar = read.csv(file)


A Final Note on Self-Cutting and Self Esteem

Are we sure we have the right dependent variable here? Maybe it would be more interesting to consider Cut.Scar as a binomial response variable, and enter Sex and RSES as explanatory variables into a Logistic Regression. Just sayin'.


Final Examples (I Promise!)

ToothGrowth Revisited

Recall the analysis we did on a built-in data set called ToothGrowth in the Factorial ANOVA tutorial. Tooth (odontoblast) growth in guinea pigs was assessed following vitamin C supplementation. Two levels of supplement ("supp") and three levels of dose ("dose") were used. We assume subjects were randomly assigned to treatment conditions, because there is no reason they shouldn't have been. Here is the ANOVA summary table from the original balanced analysis.

            Df  Sum Sq Mean Sq F value    Pr(>F)
supp         1  205.35  205.35  15.572 0.0002312
dose         2 2426.43 1213.22  92.000 < 2.2e-16
supp:dose    2  108.32   54.16   4.107 0.0218603
Residuals   54  712.11   13.19
Suppose some mysterious and random malady caused us to lose 10% of our subjects. That's a substantial amount of subject attrition. Let's see what effect it has on the analysis. So that we might all get the same results, I'll use the function set.seed() to seed the random number generator.
> set.seed(12345)                 # this should lose the same subjects for everyone
> i = sample(x=1:60, size=54)     # pick out the 54 surviving subjects
> TG = ToothGrowth[sort(i),]      # get data on the survivors
> TG$fdose = factor(TG$dose, labels=c("low","mod","high"))      # make a dose factor
> with(TG, table(supp, fdose))    # check balance
    fdose
supp low mod high
  OJ  10  10    9
  VC   9   7    9
One group in particular was hit pretty hard, losing 30% of its subjects. This causes the interaction plot to change a little, but not really a whole lot. It appears we are still looking at the same effects.
> par(mfrow=c(1,2))
> with(ToothGrowth, interaction.plot(dose,supp,len, type="b", pch=1:2, bty="l"))
> title(main="Original")
> with(TG, interaction.plot(fdose,supp,len, type="b", pch=1:2, bty="l"))
> title(main="10% Attrition")
Of course, we wouldn't be able to make this comparison in the "real world," but it's interesting to look at nevertheless. A more motivated person than myself might run some simulations here to see what the extreme cases might look like.
10% Attrition
As these cell frequencies are random and meaningless with respect to the treatment conditions, I would still choose the Type III tests, although with this much attrition, I'd be a little nervous about it.
> source("aovIII.R")              # provided you have this in your working directory
> aovIII(len ~ supp * fdose, data=TG)
Single term deletions

Model:
len ~ supp * fdose
           Df Sum of Sq     RSS    AIC F value    Pr(>F)    
<none>                   633.68 144.98                      
supp        1    169.71  803.40 155.79 12.8554 0.0007857 ***
fdose       2   2216.41 2850.10 222.17 83.9442 < 2.2e-16 ***
supp:fdose  2    141.96  775.65 151.89  5.3767 0.0078163 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We caught a bit of a break here with the significant interaction. Since that's the effect of interest now, it's not so important how the main effects were tested, but just out of curiosity, let's look at the Type II table anyway (which I got by the trick of running the Type I tests twice with the factors reversed).
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  164.3   164.3  12.445 0.00093 ***
fdose        2 2215.8  1107.9  83.921 < 2e-16 ***
supp:fdose   2  142.0    71.0   5.377 0.00782 ** 
Residuals   48  633.7    13.2   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The differences are small, one might even be inclined to say trivial, so even in the presence of a highly significant interaction, and with a more than minor amount of subject attrition, the Type II and Type III tests are performing just about the same. I'd cite the Type III tests here, but it's a moot point. It's the simple effects we should now be interested in.

The question is over how to test the simple effects. It seems to me, if we are working on the pretext of equal cell weights when we do the omnibus ANOVA, then we should do the same when we test simple effects. I don't know of any way in R to wholly automate this. Perhaps that's for the best! Perhaps we need to do a little thinking for ourselves.

I assume the following would be correct, and I'm sure if it isn't that I will soon be hearing from someone who will set me straight. You'll know as soon as I do!

> SS = function(x) sum(x^2) - sum(x)^2 / length(x)    # recreate the SS function if necessary
> harm.mean = function(x) 1/mean(1/x)                 # and the harmonic mean function
> with(TG, tapply(len, list(supp,fdose), mean)) -> means        # table of group means
> with(TG, tapply(len, list(supp,fdose), length)) -> sizes      # table of group sizes
> apply(means, 1, SS)        # unnecessary, but just to show you what it does
       OJ        VC 
 84.61276 171.21989
> apply(sizes, 1, harm.mean) # ditto
      OJ       VC 
9.642857 8.217391 
> apply(means, 1, SS) * apply(sizes, 1, harm.mean)    # simple effects sums of squares
       OJ        VC 
 815.9087 1406.9808
These would be the sums of squares for the simple effects of dose at supp=OJ and supp=VC. These SSes would be calculated around the unweighted marginal means. Next we'll get the mean squares and the F-values by using the omnibus error term. This assumes homogeneity of variance has not been violated.
> apply(means, 1, SS) * apply(sizes, 1, harm.mean) / 2          # mean squares
      OJ       VC 
407.9544 703.4904 
> apply(means, 1, SS) * apply(sizes, 1, harm.mean) / 2 / 13.2   # F
      OJ       VC 
30.90563 53.29473
We can do the same thing on margin 2 of the tables to get the simple effects tests on supp.
> apply(means, 2, SS) * apply(sizes, 2, harm.mean)              # SS and MS since df=1
     low      mod     high 
118.7898 183.2681   4.2050 
> apply(means, 2, SS) * apply(sizes, 2, harm.mean) / 13.2       # F
       low        mod       high 
 8.9992243 13.8839445  0.3185606
Following an example in Howell's textbook (see Refs), we could present these results in a summary table as follows.
Source               df      SS        MS       F      p
------------------------------------------------------------
Supplement
   supp at dose=low   1    118.79    118.79    9.00   .004
   supp at dose=mod   1    183.27    183.27   13.88  <.001
   supp at dose=high  1      4.21      4.21    0.32   .575
Dose
   dose at supp=OJ    2    815.91    407.95   30.91  <.001
   dose at supp=VC    2   1406.98    703.49   53.29  <.001

Error                48    633.68     13.20
------------------------------------------------------------
Note: The same exact thing could be done with a balanced design. Using harm.mean would be unnecessary but wouldn't matter since harm.mean and mean group sizes would be the same in a balanced design. (An aside: One has to wonder if we're looking at an interaction caused by a ceiling effect here. After all, how long can an odontoblast get?)

Loneliness

Parris Claytor (Psyc 497 Fall 2011) collected data from two surveys that she administered via an online survey system (Sona). One survey was the Embarrassability Scale (Modigliani, A. 1968. Embarrassment and embarrassability. Sociometry, 31(3), 313-326), which measures susceptibility to being embarrassed. The other was the Emotional-Social Loneliness Inventory (Vincenzi, H. and Grabosky, F. 1987. Measuring the emotional/social aspects of loneliness and isolation. Journal of Social Behavior & Personality, 2(2), pt.2, 257-270), which measures emotional and social isolation (a score for each). She was interested in seeing if the scores for embarrassability were positively correlated with those for social isolation and emotional isolation. Her data can be retrieved as follows:

> file = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/loneliness.csv"
> lone = read.csv(file)
> dim(lone)        # note: one case was deleted due to missing values
[1] 112   5
> summary(lone)
   embarrass        socialiso     emotiso        fembarrass   fsocialiso
 Min.   : 32.00   Min.   : 0   Min.   : 0.00   high   :28   high   :24  
 1st Qu.: 52.75   1st Qu.: 2   1st Qu.: 5.75   highmod:26   highmod:22  
 Median : 65.00   Median : 6   Median :11.00   low    :28   low    :33  
 Mean   : 64.90   Mean   : 7   Mean   :12.07   lowmod :30   lowmod :33  
 3rd Qu.: 76.25   3rd Qu.:10   3rd Qu.:16.00                            
 Max.   :111.00   Max.   :31   Max.   :40.00
The first three variables, the scores from the scales mentioned above, were (or will be) used in the tutorial on Simple Mediation Analysis. In that tutorial, our interest was in how emotional isolation (emotiso) is created by embarrassability (embarrass) and social isolation (socialiso). I'm going to pretend that I'm concerned about minor distributional problems in the two explanatory variables and convert them to factors. I did that by cutting them at the quartiles, the results being in "fembarrass" and "fsocialiso". A crosstabulation of those two variables will show the unbalanced condition of this design, which is due to the fact that these two variables are naturally related to each other.
> with(lone, table(fembarrass, fsocialiso))
          fsocialiso
fembarrass high highmod low lowmod
   high      12       7   2      7
   highmod    5       6   8      7
   low        2       3  15      8
   lowmod     5       6   8     11
The relationship ("confound") between these two variables is pretty obvious from this table. High embarrassability tends to go with high social isolation and low embarrassability with low social isolation.

Any of the three types of tests will show that the interaction is not significant but also is not trivial (F > 1).

                      Df Sum Sq Mean Sq F value   Pr(>F)   eta-sq
fembarrass:fsocialiso  9    512    56.9   1.670  0.10677    0.072
Attempting to draw an interaction plot finally made me realize the levels of the two factors need to be rearranged (they're ordinal), and after screwing that up once and having to reload the data, I finally got a respectable interaction plot.
> lone$fsocialiso = factor(lone$fsocialiso, levels=c("low","lowmod","highmod","high"))
> lone$fembarrass = factor(lone$fembarrass, levels=c("low","lowmod","highmod","high"))
> with(lone, interaction.plot(fsocialiso,fembarrass,emotiso,type="b",pch=1:4,bty="l"))
Interaction
Examination of the interaction plot, and use of drop1(), convinced me that the interaction is both theoretically and statistically (by AIC) unimportant, so I decided to delete it and work with the additive model.
> aov.mod1 = aov(emotiso ~ fembarrass + fsocialiso, data=lone)
> summary(aov.mod1)
             Df Sum Sq Mean Sq F value   Pr(>F)    
fembarrass    3    535   178.2   4.948  0.00298 ** 
fsocialiso    3   2756   918.8  25.507 1.79e-12 ***
Residuals   105   3782    36.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> aov.mod2 = aov(emotiso ~ fsocialiso + fembarrass, data=lone)
> summary(aov.mod2)
             Df Sum Sq Mean Sq F value   Pr(>F)    
fsocialiso    3   3278  1092.6  30.332 3.33e-14 ***
fembarrass    3     13     4.4   0.123    0.946    
Residuals   105   3782    36.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now I'll point out that, if we had done Type II tests, we'd be looking at this table right now.
             Df Sum Sq Mean Sq F value   Pr(>F)    
fembarrass    3     13     4.4   0.123    0.946
fsocialiso    3   2756   918.8  25.507 1.79e-12 ***
Residuals   105   3782    36.0
NOTE: Or this one if you don't believe in dropping interactions. It doesn't matter.
                      Df Sum Sq Mean Sq F value   Pr(>F)    
fembarrass             3     13     4.4   0.130    0.942
fsocialiso             3   2756   918.8  26.973 9.72e-13 ***
fembarrass:fsocialiso  9    512    56.9   1.670  0.10677    
Residuals             96   3270    34.1
And what would be lost? If I were reporting "main effects," this is the table that I would report (i.e., the first one, since I'm a dropper), so in that sense, nothing has been lost by just going straight to the Type II tests. I'll also show you this.
> drop1(aov.mod1, test="F")
Single term deletions

Model:
emotiso ~ fembarrass + fsocialiso
           Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>                  3782.3 408.19                      
fembarrass  3     13.29 3795.6 402.59   0.123    0.9463    
fsocialiso  3   2756.46 6538.7 463.50  25.507 1.787e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> drop1(aov.mod2, test="F")
Single term deletions

Model:
emotiso ~ fsocialiso + fembarrass
           Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>                  3782.3 408.19                      
fsocialiso  3   2756.46 6538.7 463.50  25.507 1.787e-12 ***
fembarrass  3     13.29 3795.6 402.59   0.123    0.9463    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The drop1() function reports the Type II tests no matter which order the factors are in, and in either case, AIC is saying "fembarrass" is not making an important contribution to the model. (Note: This does not work from the full model, i.e., with the interaction included.)

I can't help pointing out though that, in the Type I tests, when "fembarrass" is entered first it is significant, and when it's entered second, it is not. When "fembarrass" is entered second, "socialiso" has gobbled up almost all of the variability once attributed to "fembarrass". I can't help wondering why. Hmmmm, let's think about that. Clearly, there is a relationship between "fembarrass" and "socialiso". Whatever could it be?

Is it just an annoying confound of the kind we wish to get rid of when the data are experimental? Or is it nature trying to tell us something? If you've already read the Simple Mediation Analysis tutorial, then you already know my answer. But if you haven't, think about it a moment.

Got it? Okay, here it is. I think embarrassability causes social isolation. When a person is easily embarrassed, he tends to avoid social situations where embarrassment might occur. Social isolation then causes a sense of emotional isolation (or loneliness). I'm proposing the following causal chain: "embarrass" -> "socialiso" -> "emotiso".

If I'd been thinking traditional analysis of variance all along, and of course that means, with these data, Type II tests, I would have missed that entirely! Unless, of course, I'd stopped to think. And once I did that, I would have realized that Type I tests are the tests I want here.

If my theory is true, then the Type I tests will show the pattern we have just seen. When the immediate cause is entered first, it "sucks all the life" out of the distant cause. When the distant cause is entered first, it claims that part of the variability it creates by acting through the immediate cause. The result is very similar to what we would see in a mediation analysis. The data are observational, of course, so the data alone cannot prove our hypothesized causal links. We need additional theoretical justification for that.

But we have all the necessary relationships we would need to support the "causal chain" theory. The direct link between fembarrass and emotiso is strong, from fembarrass entered first (or entered only--it doesn't much matter). The indirect link between fembarrass and emotiso is weak to nonexistent, from fembarrass entered after fsocialiso. The link between fsocialiso and emotiso is strong even after fembarrass is removed, i.e., entered first. And finally, the link between fembarrass and fsocialiso is established by the fact that their effects change when the order in which they are entered is reversed.

Despoilation of Movies

You can hardly breathe these days without being exposed to advertising. TV has become all but unwatchable due to advertising, and yet we sit there glued to the set. The advertisers have a hook in our noses, and they know it. Better yet, for them, they have a hook in our kids' noses. Of course people complain about it, but they still buy the products, and that's all the advertisers care about. They don't care whether we like it or not as long as we're good little sheep spending our money. Even movies, which used to be more about the artistry, are now just another venue for advertising.

Park and Berger (2010) studied the effectiveness of product placements in movies by genre of film. They wanted to know how well viewers recalled products they saw in movies depending upon whether the movie was an action flick, a comedy, or a drama. The data can be obtained at the University of Florida Department of Statistics website, and can be read directly into R as follows.

> file = "http://www.stat.ufl.edu/~winner/data/movie_brand_rec.dat"
> movies = read.table(file, col.names=c("gender","genre","recall"))
> head(movies)
  gender genre recall
1      1     1      1
2      1     1      1
3      1     1      1
4      1     1      1
5      1     1      2
6      1     1      2
> summary(movies)
     gender          genre           recall     
 Min.   :1.000   Min.   :1.000   Min.   :0.000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
 Median :1.000   Median :2.000   Median :4.000  
 Mean   :1.314   Mean   :1.912   Mean   :3.964  
 3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
 Max.   :2.000   Max.   :3.000   Max.   :6.000  
> dim(movies)
[1] 137   3
(Note: In case that page disappears, as webpages are wont to do, I've also put the data here.)

Now we have some work to do to get these data prepared. Specifically, we need to make "gender" and "genre" into factors.

> movies$gender = factor(movies$gender, labels=c("female","male"))
> movies$genre = factor(movies$genre, labels=c("action","comedy","drama"))
> summary(movies)
    gender      genre        recall     
 female:94   action:53   Min.   :0.000  
 male  :43   comedy:43   1st Qu.:3.000  
             drama :41   Median :4.000  
                         Mean   :3.964  
                         3rd Qu.:5.000  
                         Max.   :6.000  
> with(movies, table(gender,genre))         # check for balance
        genre
gender   action comedy drama
  female     39     33    22
  male       14     10    19
The response, "recall", is a score based on the number of products recalled. Viewers were mostly college aged and were randomly assigned to the "genre" condition (although not to gender, of course). So far as I can tell from looking at the article, the researchers analyzed for recall by genre and recall by gender separately, but did not look at the data as a factorial design (or at least did not report that they did). We shall do so.

A good place to begin would be with an interaction plot.

> with(movies, interaction.plot(genre,gender,recall, type="b", pch=1:2, bty="l"))
I doubt we're going to see any interaction. It looks like we have two main effects. The researchers reported a significant effect of genre but no significant effect of gender.
Interaction Plot

We should have already thought about this, but we didn't, so let's think at long last about how we're going to analyze this. The researchers did random assignment to genre and still managed to get a design way out of balance on that variable. On gender, which might be considered a blocking variable in this case, they apparently made no attempt to even get close to balance. Thus, we have a design which, for lack of a better word, I'll call "seriously" unbalanced.

Given the cautions we've had about heterogeneity of variance in unbalanced designs, we would do well to check.

> with(movies, tapply(recall, list(gender,genre), var))
         action   comedy    drama
female 2.202429 1.234848 1.965368
male   1.412088 1.600000 1.912281
Looks okay. Now, to the ANOVA. I can't think of any reason why either of these variables would be logically prior to the other. That would suggest Type II or Type III tests, and given the unbalanced condition of the data, I'd go with Type II. (Type III might be a consideration because subjects were randomly assigned to genre and crossed with a blocking variable.) I'm curious about something though. Looking at the interaction plot and size of those variances, I'm a little surprised the researchers didn't find a significant difference in recall due to gender.
> t.test(recall ~ gender, data=movies, var.eq=T)

	Two Sample t-test

data:  recall by gender
t = -1.9967, df = 135, p-value = 0.04787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.982932294 -0.004697592
sample estimates:
mean in group female   mean in group male 
            3.808511             4.302326
I suspect we'd see the same thing if we enter gender first into a Type I analysis. And since it only takes a few seconds to check...
> summary(aov(recall ~ gender * genre, data=movies))
              Df Sum Sq Mean Sq F value Pr(>F)  
gender         1   7.19   7.195   4.068 0.0457 *
genre          2  11.59   5.793   3.276 0.0409 *
gender:genre   2   0.38   0.189   0.107 0.8988  
Residuals    131 231.66   1.768                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What's going on? (Note: If we delete the interaction, that effect is only going to become "more significant," due to dropping two more degrees of freedom into error but only a very tiny amount of variability.) The researchers presented a one-way ANOVA table to show the effect of genre, but they did not specify the test they did on gender, just that it was nonsignificant. Obviously, it wasn't a t-test (or equivalent one-way ANOVA). Let's flip the variables in our Type I tests and see what happens. Once again, it takes only a few seconds. It takes me longer to copy and paste it to this page than it does to do the test in R!
> summary(aov(recall ~ genre * gender, data=movies))
              Df Sum Sq Mean Sq F value Pr(>F)  
genre          2  14.38   7.191   4.066 0.0193 *
gender         1   4.40   4.399   2.488 0.1171  
genre:gender   2   0.38   0.189   0.107 0.8988  
Residuals    131 231.66   1.768                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I suspect this is why. They wanted a bigger F-value and a smaller, less marginal, p-value. In effect, they reported Type I tests with genre entered first. (The one-way ANOVA result is F(2,134) = 4.08, p = .019.) I don't see the justification for that in this case. I can see minor arguments for doing the Type I tests either way around, but I don't have any prior reason for doing that, and I don't think the arguments for it are very strong. Since the interaction is dead in the water, that means we are looking at main effects. We should probably delete the interaction term to look at the main effects, but as that makes hardly any difference at all in this example, I'll leave it.

What we want in this case from our main effects are traditional ANOVA main effects. I.e., we want what has been referred to as the "full benefit" of our factorial design. We want to see what the effect of genre is after we remove the confound with gender, and we want to see the effect of gender after we remove the confound with genre.

How do we know that? Is there the possibility of a natural relationship between genre and gender that we want to take into account? Maybe if the researcher had let subjects choose the movies they wanted to watch that would be the case. In fact, I suspect there would be a strong relationship between gender and the kinds of movies men and women choose to watch, but that's not how the study was set up. In this case, and relationship we see between genre and gender is an undesirable confound. I would not hesitate to choose Type II tests.

> library("car")
> aov.out = aov(recall ~ genre * gender, data=movies)
> Anova(aov.out)
Anova Table (Type II tests)

Response: recall
              Sum Sq  Df F value  Pr(>F)  
genre         11.587   2  3.2761 0.04089 *
gender         4.399   1  2.4877 0.11715  
genre:gender   0.378   2  0.1068 0.89879  
Residuals    231.658 131                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus, the correct report is that the effect of gender with genre removed (controlled for genre) is nonsignificant. The effect of genre with gender removed is F(2,131) = 3.28, p = .04, and not F = 4.08 as reported in the article, which is the effect of genre still confounded with gender. I.e., part of the genre effect reported in the article is actually a gender effect.

I have to wonder why this design wasn't balanced. What we essentially have here is a treatment by blocks design, with subjects randomly assigned to treatments but blocked by gender. Yet there are more than twice as many females as males. I get that if they took whoever volunteered, and their volunteers were coming from a pool of, say, psychology majors. Psych majors are generally predominantly female. But I'm sure these researchers had no intention of restricting their ability to generalize these results to psychology majors. Therefore, representing the population proportions in the cell frequencies is moot. Okay, so they took who they could get (volunteers), but they randomly assigned to genre. Why isn't the design at least balanced within genders (proportional balance)?

The more I'm thinking about this, the more I'm thinking we want to analyze as if the design were balanced. With this degree of imbalance, however, I would do that very cautiously. Nevertheless, let's give it a shot just to see what happens.

> ### Step 0: make sure the "car" package has been loaded
> ### Step 1: reset the contrasts
> options(contrasts=c("contr.sum","contr.poly"))
> ### Step 2: do the anova with the sum-to-zero contrasts
> aov.outIII = aov(recall ~ genre * gender, data=movies)
> ### Step 3: get the Type III tests
> Anova(aov.outIII, type=3)
Anova Table (Type III tests)

Response: recall
              Sum Sq  Df   F value  Pr(>F)    
(Intercept)  1831.75   1 1035.8317 < 2e-16 ***
genre          11.62   2    3.2854 0.04053 *  
gender          4.56   1    2.5786 0.11073    
genre:gender    0.38   2    0.1068 0.89879    
Residuals     231.66 131                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The differences are trivial. But if they were not trivial, I'd trust the Type II tests long before I'd trust the Type III tests.

  • Park, D-J. and Berger, B.K. (2010). Brand placement in movies: The effect of film genre on viewer recognition. Journal of Promotion Management, 16, 428-444.

created 2016 February 28; updated 2016 March 25
| Table of Contents | Function Reference | Function Finder | R Project |