This is a short tutorial on writing model formulae for ANOVA and regression analyses. It will be linked to from those tutorials, but you are welcome to read it just for kicks if you'd like.
R functions such as aov(),
glm() use a formula
interface to specify the variables to be included in the analysis. The formula
determines the model that will be built and tested by the R procedure. The
basic format of such a formula is...
A basis regression analysis would be formulated this way...
You may have noticed already that some formula structures can be specified in
more than one way.
The nature of the variables--binary, categorial (factors), numerical--will
determine the nature of the analysis. For example, if "u" and "v" are factors...
A Note On The Error() Term
Probably the most confusing thing about the R formula interface is the Error() term. Fortunately, this Error term is unnecessary in designs that are between subjects or completely randomized. In factorial designs with factors completely crossed, for example, the error term is not used. If any subject, say Fred, can be assigned at random to any cell in the design, then we have a completely randomized design, and no Error term is needed. That's because there is one and only one way to calculate error, which is the variability of scores (subjects) within the treatment cells. R will figure it out.
The Error term becomes necessary when there is some restriction on randomization, e.g., when the design includes such features as nesting, blocking, repeated measures, or within subjects factors. In a randomized block design, for example, Fred will fall in one and only one of the blocks. Within that block, he'll be assigned at random to a treatment condition, but there is no chance that he will be assigned to the block "people from Pittsburgh" if he lives in Cleveland.
The problem arises because of the notation that R uses. It differs from traditional statistical notation. Keppel (1973), when discussing nesting, makes an analogy to the case of subjects in a single-factor between subjects design. If the factor is A, and A has k levels, Fred will be assigned to one of those levels, as will all other subjects. Subjects (thought of as a factor with n levels) is not crossed with A. The error term for the ANOVA is calculated from subjects within treatment levels, and Keppel denotes this variability component as S/A, subjects within A, as do most other books on experimental design with which I am familiar. In R's notation, it would be the other way around.
In a single-factor repeated measures design, on the other hand, Keppel uses AxS to denote the error component of variability, because that's what it is. Subjects, thought of as another factor, is crossed with the treatment, A, and it is the treatment-by-subjects interaction that constitutes error. In R, the error term would be Error(S/A), which would be read as treatments within subjects.
I'm not really sure I can help you out much here, if you're used to the older notation, because it (R) is confusing. I'll tell you basically how the error term works, however. The error term is often going to be Error(factor1/factor2/factor3/...), with the understanding that subjects may be one of those factors. The order in which the factors are listed is from most inclusive to least inclusive.
Let's say you have a nested design: townships within counties within states. To recognize this nesting in the Error term you would use Error(states/counties/townships). If you have gardens within fields within plots in a split plot design, the Error is Error(plots/fields/gardens). So far so good. The confusion arises (at least mine does) with repeated measures designs.
In repeated measures designs, various effects are going to been within subjects. For example, in a single-factor design, THE effect of A is seen within subjects. That is, in principle at least, the effect can be seen within Fred and within each and every subject, because each subject is measured at each of the k levels of A. Thus, A (think of it as the effect of A now) is within S, and the error term is Error(S/A), or Error(subject/treatment), where "subject" is the name of the subject identifier in your data frame. Subjects encompass, or engulf, the treatment, like counties engulf townships.
In a two-factor design with measures repeated on both factors (A and B), all of the effects can, in principle, be seen within Fred. Fred has a score in every cell of the design table, and so we can see the main effect of A, the main effect of B, and the AxB interaction all within Fred (and every other subject). Thus, the error term is Error(subject/(A+B+A:B)), listing all of the effects seen within subjects after the slash. Equivalently, in R's notation, that could be written as Error(subject/(A*B)). The second set of parentheses in the "denominator" of the error term is absolutely essential even when no ambiguity could arise. I'm not sure why.
In a two-factor design in which only A is tested within subjects and B is tested between subjects, the A effect can be seen within each subject, but the B effect, and therefore also the AxB interaction, can only be seen between two different subjects. We would have to look at data from both Fred and Sam to see B and AxB. Thus, the error term is Error(subject/A).
I'll leave further discussion of this to within the individual tutorials, where we can have some concrete examples.
revised 2016 January 13; updated 2016 February 10