R Tutorials--Model Formulae

MODEL FORMULAE

This is a short tutorial on writing model formulae for ANOVA and regression analyses. It will be linked to from those tutorials, but you are welcome to read it just for kicks if you'd like.

R functions such as aov(), lm(), and glm() use a formula interface to specify the variables to be included in the analysis. The formula determines the model that will be built and tested by the R procedure. The basic format of such a formula is...

response variable ~ explanatory variables

The tilde (the symbol between the response and explanatory variables) should be read "is modeled by" or "is modeled as a function of." The trick is in how the explanatory variables are specified.

A basis regression analysis would be formulated this way...

y ~ x

...where "x" is the explanatory variable or IV, and "y" is the response variable or DV. Additional explanatory variables would be added in as follows:

y ~ x + z

...which would make this a multiple regression with two predictors. This raises a critical issue that must be understood to get model formulae correct. Symbols used as mathematical operators in other contexts do not have their usual mathematical meaning inside model formulae. The following table lists the meaning of these symbols when used in a formula.

symbol	example	meaning
+	+ x	include this variable
-	- x	delete this variable
:	x : z	include the interaction between these variables
*	x * z	include these variables and the interactions between them
/	x / z	nesting: include z nested within x
\|	x \| z	conditioning: include x given z
^	(u + v + w + z)^3	include these variables and all interactions up to three way
poly	poly(x,3)	polynomial regression: orthogonal polynomials
Error	Error(a/b)	specify an error term
I	I(x*z)	as is: include a new variable consisting of these variables multiplied; I(x^2) means include this variable squared, etc. In other words I( ) isolates the mathematic operations inside it.
1	- 1	intercept: delete the intercept (regress through the origin)

You may have noticed already that some formula structures can be specified in more than one way.

y ~ u + v + w + u:v + u:w + v:w + u:v:w
y ~ u * v * w
y ~ (u + v + w)^3

All three of these specify a model in which the variables "u", "v", "w", and all the interactions between them are included. Any of these formats...

y ~ u + v + w + u:v + u:w + v:w
y ~ u * v * w - u:v:w
y ~ (u + v + w)^2

...would delete the three way interaction but include the two way interactions.

The nature of the variables--binary, categorial (factors), numerical--will determine the nature of the analysis. For example, if "u" and "v" are factors...

y ~ u + v

...dictates an analysis of variance (without the interaction term). If "u" and "v" are numerical, the same formula would dictate a multiple regression. If "u" is numerical and "v" is a factor, then an analysis of covariance is dictated.

A Note On The Error() Term

Probably the most confusing thing about the R formula interface is the Error() term. Fortunately, this Error term is unnecessary in designs that are between subjects or completely randomized. In factorial designs with factors completely crossed, for example, the error term is not used. If any subject, say Fred, can be assigned at random to any cell in the design, then we have a completely randomized design, and no Error term is needed. That's because there is one and only one way to calculate error, which is the variability of scores (subjects) within the treatment cells. R will figure it out.

The Error term becomes necessary when there is some restriction on randomization, e.g., when the design includes such features as nesting, blocking, repeated measures, or within subjects factors. In a randomized block design, for example, Fred will fall in one and only one of the blocks. Within that block, he'll be assigned at random to a treatment condition, but there is no chance that he will be assigned to the block "people from Pittsburgh" if he lives in Cleveland.

The problem arises because of the notation that R uses. It differs from traditional statistical notation. Keppel (1973), when discussing nesting, makes an analogy to the case of subjects in a single-factor between subjects design. If the factor is A, and A has k levels, Fred will be assigned to one of those levels, as will all other subjects. Subjects (thought of as a factor with n levels) is not crossed with A. The error term for the ANOVA is calculated from subjects within treatment levels, and Keppel denotes this variability component as S/A, subjects within A, as do most other books on experimental design with which I am familiar. In R's notation, it would be the other way around.

In a single-factor repeated measures design, on the other hand, Keppel uses AxS to denote the error component of variability, because that's what it is. Subjects, thought of as another factor, is crossed with the treatment, A, and it is the treatment-by-subjects interaction that constitutes error. In R, the error term would be Error(S/A), which would be read as treatments within subjects.

I'm not really sure I can help you out much here, if you're used to the older notation, because it (R) is confusing. I'll tell you basically how the error term works, however. The error term is often going to be Error(factor1/factor2/factor3/...), with the understanding that subjects may be one of those factors. The order in which the factors are listed is from most inclusive to least inclusive.

Let's say you have a nested design: townships within counties within states. To recognize this nesting in the Error term you would use Error(states/counties/townships). If you have gardens within fields within plots in a split plot design, the Error is Error(plots/fields/gardens). So far so good. The confusion arises (at least mine does) with repeated measures designs.

In repeated measures designs, various effects are going to been within subjects. For example, in a single-factor design, THE effect of A is seen within subjects. That is, in principle at least, the effect can be seen within Fred and within each and every subject, because each subject is measured at each of the k levels of A. Thus, A (think of it as the effect of A now) is within S, and the error term is Error(S/A), or Error(subject/treatment), where "subject" is the name of the subject identifier in your data frame. Subjects encompass, or engulf, the treatment, like counties engulf townships.

In a two-factor design with measures repeated on both factors (A and B), all of the effects can, in principle, be seen within Fred. Fred has a score in every cell of the design table, and so we can see the main effect of A, the main effect of B, and the AxB interaction all within Fred (and every other subject). Thus, the error term is Error(subject/(A+B+A:B)), listing all of the effects seen within subjects after the slash. Equivalently, in R's notation, that could be written as Error(subject/(A*B)). The second set of parentheses in the "denominator" of the error term is absolutely essential even when no ambiguity could arise. I'm not sure why.

In a two-factor design in which only A is tested within subjects and B is tested between subjects, the A effect can be seen within each subject, but the B effect, and therefore also the AxB interaction, can only be seen between two different subjects. We would have to look at data from both Fred and Sam to see B and AxB. Thus, the error term is Error(subject/A).

I'll leave further discussion of this to within the individual tutorials, where we can have some concrete examples.

Keppel, G. (1973). Design and Analysis: A Researcher's Handbook. Englewood Cliffs, NJ: Prentice-Hall.

revised 2016 January 13; updated 2016 February 10