This is a short tutorial on writing model formulae for ANOVA and regression
analyses. It will be linked to from those tutorials, but you are welcome to read
it just for kicks if you'd like.
R functions such as aov( ), lm( ), and glm( ) use a formula
interface to specify the variables to be included in the analysis. The formula
determines the model that will be built (and tested) by the R procedure. The
basic format of such a formula is...
response variable ~ explanatory variables
The tilde should be read "is modeled by" or "is modeled as a function of." The
trick is in how the explanatory variables are given.
A basis regression analysis would be formulated this way...
y ~ x
...where "x" is the explanatory variable or IV, and "y" is the response variable
or DV. Additional explanatory variables would be added in as follows...
y ~ x + z
...which would make this a multiple regression with two predictors. This raises
a critical issue that must be understood to get model formulae correct. Symbols
used as mathematical operators in other contexts do not have their usual
mathematical meaning inside model formulae. The following table lists the
meaning of these symbols when used in a formula.
|+||+ x||include this variable|
|-||- x||delete this variable|
|:||x : z||include the interaction between these variables|
|*||x * z||include these variables and the interactions between them|
|/||x / z||nesting: include z nested within x|
||||x | z||conditioning: include x given z|
|^||(u + v + w)^3||include these variables and all interactions up to three way|
|poly||poly(x,3)||polynomial regression: orthogonal polynomials|
|Error||Error(a/b)||specify the error term|
|I||I(x*z)||as is: include a new variable consisting of these variables multiplied|
|1||- 1||intercept: delete the intercept (regress through the origin)|
You may have noticed already that some formula structures can be specified in
more than one way...
y ~ u + v + w + u:v + u:w + v:w + u:v:w
y ~ u * v * w
y ~ (u + v + w)^3
All three of these specify a model in which the variables "u", "v", "w", and all
the interactions between them are included. Any of these formats...
y ~ u + v + w + u:v + u:w + v:w
y ~ u * v * w - u:v:w
y ~ (u + v + w)^2
...would delete the three way interaction.
The nature of the variables--binary, categorial (factors), numerical--will
determine the nature of the analysis. For example, if "u" and "v" are factors...
y ~ u + v
...dictates an analysis of variance (without the interaction term). If "u" and
"v" are numerical, the same formula would dictate a multiple regression. If "u"
is numerical and "v" is a factor, then an analysis of covariance is dictated.
That ought to do if for now. Specific examples will appear in the tutorials
devoted to specific analyses.
revised 2013 June 22
Return to the Table of Contents