| Table of Contents
| Function Reference
| Function Finder
| R Project |
LOG LINEAR ANALYSIS
Preliminaries
Model Formulae
If you have not yet read the Model Formulae
tutorial, reading it before you proceed might help you with parts of this one,
model-formula-wise that is.
A Little Theory
Most of these tutorials have spared the theory in favor of just showing
"how to do it" in R. My assumption has been that discussion of theory is
either unnecessary or has been covered elsewhere. Log linear analysis didn't
exist (or was in its infancy) when I took my grad courses in stats, and that may
also be true for some of you, so just a little theory might be in order. For a
more detailed discussion, I refer you to the excellent Chapter 17 of Howell
(2007). (Note: This chapter has been removed from the current edition of
Howell's textbook, but it may still be available at
his website.)
We will examine a data set called "Titanic", which is a built-in data set
describing the outcome of the Titanic sinking in 1912. The data object is a
four-dimensional table.
> data(Titanic)
> dimnames(Titanic)
$Class
[1] "1st" "2nd" "3rd" "Crew"
$Sex
[1] "Male" "Female"
$Age
[1] "Child" "Adult"
$Survived
[1] "No" "Yes"
As you can see, there are four categorical variables crosstabulated in this
table. They are "Class" of the passengers (i.e., class of ticket they held)
in the rows of the table (1st, 2nd,
3rd, Crew), "Sex" of the passengers in the columns (Male, Female), "Age" of the
passengers in the layers or strata (Child, Adult), and whether or not the
passengers "Survived" in the 4th dimension of the table (No, Yes). There are...
> margin.table(Titanic)
[1] 2201
...cases in the table. These were the data originally collected by the British
Board of Trade in its inquiry into the sinking. (For the interested, there is
updated and additional data available at the Titanic Encyclopedia
website.)
Log linear analysis allows us to look for relationships among the variables
in a multiway contingency table like this one. Like it's older sibling, chi
square analysis for two-way tables, log linear analysis assumes all the
observations represented in the table (not the variables, the individual
observations or subjects) are independent (could be a problem in this data set)
and the cell-by-cell expected frequencies will be sufficiently high, generally
five or more. As long as all EFs are one or more, and no more than 20% of them
are below five, the accuracy of the procedure will not be seriously
jeopardized, but power will take a nose dive! (These are essentially the same
assumptions made by the Pearson chi-square test.)
Let's look at the data collapsed over "Class" and "Age".
> margin.table(Titanic, c(2,4)) # show dimensions 2 and 4, collapse across others
Survived
Sex No Yes
Male 1364 367
Female 126 344
We see that women were much more likely to survive than men. The odds ratio in
favor of survival if you were a woman vs. a man on the Titanic was...
> (344/126) / (367/1364)
[1] 10.14697
...better than 10:1 ("ten to one"). (Odds of a female surviving: 344/126. Odds
of a male surviving: 367/1364. Odds ratio is the ratio of these two figures.)
The relative likelihood (or likelihood ratio) of a female vs. a male surviving
was...
> (344/(344+126)) / (367/(367+1364))
[1] 3.452165
...which is just the proportion of females who survived divided by the
proportion of males who survived. So females were almost 3.5 times as likely
to survive. This is the language of log linear analysis. A Pearson chi-square
test on this table would no doubt reveal a highly significant relationship
(called an "interaction" in this context) between these two factors.
Chi square is the basic building block of log linear analysis, but in a
slightly different form called likelihood ratio chi square. The advantage of
this is that likelihood ratio chi squares are additive. That is, the chi
squares derived from simplier effects add together to produce the chi squares
derived from more complex effects (or "models," I should say). R (to my
knowledge) does not incorporate a likelihood ratio chi square test for two-way
tables (but see below), as it is not the normal procedure, but there is nothing
stopping us from using a likelihood ratio test on two-way tables, so here is
a script.
### begin copying script here
likelihood.test = function(x) {
nrows = dim(x)[1] # no. of rows in contingency table
ncols = dim(x)[2] # no. of cols in contingency table
chi.out = chisq.test(x,correct=F) # do a Pearson chi square test
table = chi.out[[6]] # get the OFs
ratios = chi.out[[6]]/chi.out[[7]] # calculate OF/EF ratios
sum = 0 # storage for the test statistic
for (i in 1:nrows) {
for (j in 1:ncols) {
sum = sum + table[i,j]*log(ratios[i,j])
}
}
sum = 2 * sum # the likelihood ratio chi square
df = chi.out[[2]] # degrees of freedom
p = 1 - pchisq(sum,df) # p-value
out = c(sum, df, p, chi.out[[1]]) # the output vector
names(out) = c("LR-chisq","df","p-value","Pears-chisq")
round(out,4) # done!
}
### end copying script here and paste into R Console
Highlight and copy this script and then paste it into your R Console window at
the prompt. If necessary, hit the Enter key at the end. (It won't hurt if you do
this anyway.) This will create a function in your workspace called likelihood.test(). It's used this way.
> sex.survived = margin.table(Titanic, c(2,4)) # create the contingency table
> likelihood.test(sex.survived)
LR-chisq df p-value Pears-chisq
434.4688 1.0000 0.0000 456.8742
The output consists of the likelihood ratio chi-square statistic, the degrees of
freedom associated with it (same as for Pearson's), the p-value, and for
comparison the Pearson's chi-square value. (Pardon the formatting of the
output, but it's a quick and dirty script, and we won't need it for long.)
We are attempting to explain (model) the frequencies in the contingency
table. In a traditional chi square test (of either type), our model (null
hypothesis) consists of
an effect for each factor but no interaction between them. If the test allows us
to reject the null hypothesis, then we conclude that this model is not adequate to
explain the cell frequencies, and that to do so we must assume the factors are
interacting (are not independent or are related in the traditional language of
chi square).
In a traditional two-way chi square test, a simpler model would usually not
make sense, but just to give you an idea, suppose we proposed the following
model (i.e., null hypothesis): all the cell frequencies are equal. That is to
say, there is not only NOT an interaction between the two variables, but there
are also no effects of either factor. There is nothing to stop us from testing
this null hypothesis using a traditional chi square test, and we would do it by
setting all the EFs to 2201/4=550.25. (Such a test would cost us only 1 df, by
the way, since our only restriction on the data is that the EFs all add to N.
So on a 2 × 2 contingency table, the test would have df=3.) In
the language of log linear analysis, this all-frequencies-equal model is called
the "null model."
At the other extreme, we could posit a model that includes not only simple
effects for each factor but also all possible interactions between them. Such a
model would explain (predict, model) the cell frequencies perfectly, would
result in a chi square statistic of zero on zero degrees of freedom, and would
have no explanatory value at all. Such a model, in the language of log linear
analysis, is called the "saturated model."
End of theory. If you want a deeper explanation than this fairly superficial
one, see Howell (2007).
Log Linear Analysis
Our goal is to explain the observed frequencies in the "Titanic" table with
as simple a model as possible. Here is what we have to work with.
EFFECTS | WHAT THEY MEAN |
Class | there are more passengers in some classes than in others |
Sex | there are more passengers of one sex than of the other |
Age | there are more passengers of one age group than of the other |
Survived | more passengers lived or died than the alternative |
Class × Sex | Class and Sex are not independent |
Class × Age | Class and Age are not independent |
Class × Survived | Class and Survived are not independent |
Sex × Age | Sex and Age are not independent |
Sex × Survived | Sex and Survived are not independent |
Age × Survived | Age and Survived are not independent |
Class × Sex × Age | there is a three-way interaction between these factors |
Class × Sex × Survived | there is a three-way interaction between these factors |
Class × Age × Survived | there is a three-way interaction between these factors |
Sex × Age × Survived | there is a three-way interaction between these factors |
Class × Sex × Age × Survived | there is a four-way interaction between these factors |
Notice we are not treating any of the factors differently, i.e., as response
vs. explanatory variables. We could. We're just not, at this time. It's
crucial that you keep this in mind. There is a natural tendency here to see
"Survived" as a response variable that we are trying to explain. We are NOT (at
this time). We are trying to model (explain, predict, account for) the
frequencies in the contingency table.
Begin like this.
> summary(Titanic)
Number of cases in table: 2201
Number of factors: 4
Test for independence of all factors:
Chisq = 1637.4, df = 25, p-value = 0
Chi-squared approximation may be incorrect
You have just done a log linear analysis on the model that includes only the
simple effects, i.e., the effects of the factors without any interactions
between them, the first four lines of the above table. This model (null
hypothesis) is rejected. Somewhere in these data, there are interactions we must
find to explain the cell frequencies adequately. (The warning in the last line
tells us we have EFs below 5. This is the only time you'll see this warning, so
it's always a good idea to start off this way, in my opinion.)
Several functions in R can be used to fit log linear models. They include
loglin() in the "stats" library (loaded by default),
loglm() in the "MASS" library, and
glm() in the "stats" library. Let's experiment a
little.
> library("MASS") # load MASS so we can use loglm()
> loglm( ~ Sex + Survived, data=sex.survived)
Call:
loglm(formula = ~ Sex + Survived, data = sex.survived)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 434.4688 1 0
Pearson 456.8742 1 0
We are looking at the 2D table "Sex" by "Survived" that we created above, we
are putting in only
the simple effects of individual factors, and we are getting out the same
result as we got when we used my
likelihood.test()
function. In other words, we've used log linear modeling to do a chi-square
test of independence on this two-way contingency table. The model with just the
simple effects is inadequate (null hypotheis rejected, p virtually 0). There
must be an interaction between these two variables (we would propose).
Now let's do the same for the full data table.
> loglm(~ Class + Sex + Age + Survived, data=Titanic)
Call:
loglm(formula = ~Class + Sex + Age + Survived, data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 1243.663 25 0
Pearson 1637.445 25 0
The result is the same one we got with the
summary(Titanic) command.
Essentially, we've done a four-way chi-square test of independence, and we
reject the null hypothesis of independence. Somewhere in there are factors that
are interacting with each other to produce the observed cell frequencies. The
saturated model, on the other hand, produces nothing useful, as it always
explains (predicts, models) the observed frequencies completely.
> loglm(~ Class * Sex * Age * Survived, data=Titanic)
Call:
loglm(formula = ~Class * Sex * Age * Survived, data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 0 0 1
Pearson NaN 0 1
Null hypothesis (i.e., the saturated model) retained (p virtually 1). The
"truth" (we hope) lies somewhere between these two extremes.
Let's try it without the fourway interaction.
> loglm(~ Class * Sex * Age * Survived - Class:Sex:Age:Survived, data=Titanic)
Call:
loglm(formula = ~Class * Sex * Age * Survived - Class:Sex:Age:Survived,
data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 0.0002728865 3 0.9999988
Pearson NaN 3 NaN
The likelihood ratio chi-square test (the Pearson test failed) shows this
model predicts cell frequencies that are not significantly different from those
actually observed. The fourway interaction is expendable in our model, because
when we "expended" it, we could not reject the model that doesn't have it. It does
not help us understand (model) the observed frequencies.
On the other hand...
> loglm(~ Class + Sex + Age + Survived + Age:Survived, data=Titanic)
Call:
loglm(formula = ~Class + Sex + Age + Survived + Age:Survived,
data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 1224.103 24 0
Pearson 1596.846 24 0
...the model with all factors plus the Age × Survived
interaction DOES predict frequencies (EFs) that are significantly different from
those actually observed. We need to add more interaction terms to get at the
"real truth" of the matter (to create an adequate statistical model of what
happened on the Titanic).
Testing a Specific Hypothesis
Suppose we began with the hypothesis that gender was related to survival on
the Titanic. From the two-way (bivariate) chi-square test we did above, this
appears to be true. However, "Sex" is also related to "Class" (p virtually zero
by chi-square test) and "Age" (p virtually zero), and both "Class" and "Age" are
significantly related to "Survived". So "Class" and "Age" might be confounds in
the relationship between "Sex" and "Survived". Like a multiple regression for
numeric variables, log linear analysis will allow us to tease apart these
effects.
If we remove all the interaction terms that involve both "Sex" and
"Survived", and the model still fits the observed frequencies adequately, then
we can conclude that gender and survival were unrelated.
> sat.model = loglm(~ Class * Sex * Age * Survived, data=Titanic)
> sat.model
Call:
loglm(formula = ~Class * Sex * Age * Survived, data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 0 0 1
Pearson NaN 0 1
> ### already knew this...
> model2 = update(sat.model, ~.-(Class:Sex:Age:Survived+Sex:Age:Survived+
+ Class:Sex:Survived+Sex:Survived))
> model2
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age +
Sex:Age + Class:Survived + Age:Survived + Class:Sex:Age +
Class:Age:Survived, data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 436.2715 8 0
Pearson NaN 8 NaN
The model does not adequately fit the observed frequencies unless we include
some sort of "Sex" by "Survived" interaction. Notice the use of update here to
get the new model. We could just have retyped the model formula with the
factors we wanted to include into a new loglm()
analysis, but update() is often a bit more
convenient. The syntax reads "update
sat.model by using all the same predictors (dot means "same") except subtract out
(minus) the following terms." Let's do one with only the two-way
interactions involving both "Sex" and "Survived".
> model3 = update(sat.model, ~.-(Class:Age:Sex:Survived+Sex:Age:Survived+
+ Class:Sex:Survived))
> model3
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age +
Sex:Age + Class:Survived + Sex:Survived + Age:Survived +
Class:Sex:Age + Class:Age:Survived, data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 76.90406 7 5.884182e-14
Pearson NaN 7 NaN
Nope! This model is also rejected. Apparently, the "Sex" by "Survived"
interaction is conditioned on (dependent upon, changes form with) another one
of the factors. I.e., there is an important three-way interaction involving
"Sex", "Survived", and one of the other variables.
As you can well imagine, this could be a tedious and error prone way to
construct a log linear model. R provides a way to automate model building.
The step() Function
Although I am generally hostile to computer-automated statistical model
building, in this circumstance I am willing to make an exception.
> step(sat.model, direction="backward")
Start: AIC=64
~Class * Sex * Age * Survived
Df AIC
- Class:Sex:Age:Survived 3 58
64
Step: AIC=58
~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age +
Class:Survived + Sex:Survived + Age:Survived + Class:Sex:Age +
Class:Sex:Survived + Class:Age:Survived + Sex:Age:Survived
Df AIC
- Sex:Age:Survived 1 57.685
58.000
- Class:Sex:Age 3 61.783
- Class:Age:Survived 3 89.263
- Class:Sex:Survived 3 117.013
Step: AIC=57.69
~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age +
Class:Survived + Sex:Survived + Age:Survived + Class:Sex:Age +
Class:Sex:Survived + Class:Age:Survived
Df AIC
57.685
- Class:Sex:Age 3 71.953
- Class:Age:Survived 3 95.899
- Class:Sex:Survived 3 126.904
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age +
Sex:Age + Class:Survived + Sex:Survived + Age:Survived +
Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived,
data = Titanic, evaluate = FALSE)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 1.685479 4 0.7933536
Pearson NaN 4 NaN
The procedure here is to build the saturated model ("sat.model") and then
subject it to what might be loosely called "backward regression" using the
step() function with "direction=" set to
"backward" (other possibilites
are "forward" and "both"). This will simplify the model using a criterion called
Akaike's Information Criterion (AIC). Smaller values of AIC are better. In this
case, R has arrived at a model in which only two interactions have been
removed, the four-way and Sex:Age:Survived. It appears the relationship between
"Sex" and "Survived" is conditioned on "Class". Let's check this out...
> margin.table(Titanic, c(2,4,1))
, , Class = 1st
Survived
Sex No Yes
Male 118 62
Female 4 141
, , Class = 2nd
Survived
Sex No Yes
Male 154 25
Female 13 93
, , Class = 3rd
Survived
Sex No Yes
Male 422 88
Female 106 90
, , Class = Crew
Survived
Sex No Yes
Male 670 192
Female 3 20
The odds ratios are easily enough calculated by hand from these tables.
> ### odds ratio 1st class
> (141/4) / (62/118)
[1] 67.08871
> ### odds ratio 2nd class
> (93/13) / (25/154)
[1] 44.06769
> ### odds ratio 3rd class
> (90/106) / (88/422)
[1] 4.071612
> ### odds ratio crew
> (20/3) / (192/670)
[1] 23.26389
In all classes the odds of a female surviving were better than the odds of a
male surviving, but this was especially true in first class, and only
somewhat true in third class.
Getting More Information From The Model
Let's say we want to go with the model produced by
step() as our
model of what happened on the Titanic. First, I'll store the model in a data
object (and I will copy and paste the command to produce the model from the
output of step(), because it's unlikely I'd
type all that correctly).
> loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age +
+ Sex:Age + Class:Survived + Sex:Survived + Age:Survived +
+ Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived,
+ data = Titanic) -> step.model
Following are some of the extractor functions available for getting information
from the model object (see the help page for a complete list).
> print(step.model) # same as just typing step.model
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age +
Sex:Age + Class:Survived + Sex:Survived + Age:Survived +
Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived,
data = Titanic)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 1.685479 4 0.7933536
Pearson NaN 4 NaN
> fitted(step.model) # the EFs produced by the model
Re-fitting to get fitted values
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0.00000 0.00000
2nd 0.00000 0.00000
3rd 37.43281 14.56719
Crew 0.00000 0.00000
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 118.0000 4.0000
2nd 154.0000 13.0000
3rd 384.5672 91.4328
Crew 670.0000 3.0000
, , Age = Child, Survived = Yes
Sex
Class Male Female
1st 5.00000 1.00000
2nd 10.98493 13.01507
3rd 10.56718 16.43282
Crew 0.00000 0.00000
, , Age = Adult, Survived = Yes
Sex
Class Male Female
1st 57.00000 140.00000
2nd 14.02291 79.97709
3rd 77.43281 73.56719
Crew 192.00000 20.00000
> resid(step.model) # the standardized residuals
Re-fitting to get frequencies and fitted values
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0.000000e+00 0.000000e+00
2nd 0.000000e+00 0.000000e+00
3rd -4.020602e-01 6.208006e-01
Crew 0.000000e+00 0.000000e+00
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 0.000000e+00 0.000000e+00
2nd 0.000000e+00 0.000000e+00
3rd 1.239264e-01 -2.555637e-01
Crew 0.000000e+00 0.000000e+00
, , Age = Child, Survived = Yes
Sex
Class Male Female
1st 2.142148e-08 -4.552313e-08
2nd 4.546268e-03 -4.178434e-03
3rd 7.221252e-01 -6.159455e-01
Crew 0.000000e+00 0.000000e+00
, , Age = Adult, Survived = Yes
Sex
Class Male Female
1st -6.825888e-08 6.182584e-08
2nd -6.118921e-03 2.561368e-03
3rd -2.779358e-01 2.820973e-01
Crew 0.000000e+00 0.000000e+00
Ordinarily, you can also get regression coefficients--
coef(step.model) --but that
failed in this case. I'm not sure why. In any event, I don't see much point in
getting the coefficients, as they are only useful for calculating EFs, which we
already have. After all, it's not like we are going to come up with a new level
of "Survived" for which we need to make predictions!
It's worth noting that our EFs contain several zeros. Our model is in danger
of being inaccurate for this reason.
Using glm() For Log Linear Modeling
We can also do log linear analysis using the
glm() function, which is
the function used for generalized linear models. For this we need a data frame
rather than a contingency table, but this is easy enough to get.
> ti = as.data.frame(Titanic)
> ti # it's a lot easier to look at the data this way too!
Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
7 3rd Female Child No 17
8 Crew Female Child No 0
9 1st Male Adult No 118
10 2nd Male Adult No 154
11 3rd Male Adult No 387
12 Crew Male Adult No 670
13 1st Female Adult No 4
14 2nd Female Adult No 13
15 3rd Female Adult No 89
16 Crew Female Adult No 3
17 1st Male Child Yes 5
18 2nd Male Child Yes 11
19 3rd Male Child Yes 13
20 Crew Male Child Yes 0
21 1st Female Child Yes 1
22 2nd Female Child Yes 13
23 3rd Female Child Yes 14
24 Crew Female Child Yes 0
25 1st Male Adult Yes 57
26 2nd Male Adult Yes 14
27 3rd Male Adult Yes 75
28 Crew Male Adult Yes 192
29 1st Female Adult Yes 140
30 2nd Female Adult Yes 80
31 3rd Female Adult Yes 76
32 Crew Female Adult Yes 20
The log linear analysis is then produced this way.
> glm.model = glm(Freq ~ Class * Age * Sex * Survived, data=ti, family=poisson)
This is the saturated model. The extractor function
summary(glm.model) would
produce a wealth of puzzling output, but the extractor function I like
best is...
> anova(glm.model, test="Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: Freq
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 31 4953.1
Class 3 475.8 28 4477.3 8.326e-103
Age 1 2183.6 27 2293.8 0.0
Sex 1 768.3 26 1525.4 4.169e-169
Survived 1 281.8 25 1243.7 3.078e-63
Class:Age 3 148.3 22 1095.3 6.048e-32
Class:Sex 3 412.6 19 682.7 4.126e-89
Age:Sex 1 6.1 18 676.6 1.363e-02
Class:Survived 3 180.9 15 495.7 5.634e-39
Age:Survived 1 25.6 14 470.2 4.237e-07
Sex:Survived 1 353.6 13 116.6 7.053e-79
Class:Age:Sex 3 4.0 10 112.6 0.3
Class:Age:Survived 3 35.7 7 76.9 8.825e-08
Class:Sex:Survived 3 75.2 4 1.7 3.253e-16
Age:Sex:Survived 1 1.7 3 4.237e-10 0.2
Class:Age:Sex:Survived 3 0.0 0 4.463e-10 1.0
Deviance is a fancy term for the likelihood ratio chi-square value. The null
model (all frequencies equal) is at the top. Clearly, the null model is not
sufficient to describe this data set, as adding in subsequent terms produces
significant reductions in deviance. At each step down the table, another
predictor is added in, and this removes the amount of Deviance shown in that
column, leaving the amount in the Resid. Dev. column. The p-value
in the last column is based on the change in Deviance (column two) and the
change in Df (column one) produced by adding that term into the model. Here we
are looking for a significant reduction in deviance, indicating the new
term helped in predicting cell (observed) frequencies.
This output suggests we might want to investigate three terms for possible
elimation: the four-way interaction and two three-way interactions. Recall
that step() retained the Class:Age:Sex interaction.
> anova(update(glm.model,.~.-(Class:Age:Sex:Survived+Age:Sex:Survived
+ +Class:Age:Sex)),test="Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: Freq
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 31 4953.1
Class 3 475.8 28 4477.3 8.326e-103
Age 1 2183.6 27 2293.8 0.0
Sex 1 768.3 26 1525.4 4.169e-169
Survived 1 281.8 25 1243.7 3.078e-63
Class:Age 3 148.3 22 1095.3 6.048e-32
Class:Sex 3 412.6 19 682.7 4.126e-89
Age:Sex 1 6.1 18 676.6 1.363e-02
Class:Survived 3 180.9 15 495.7 5.634e-39
Age:Survived 1 25.6 14 470.2 4.237e-07
Sex:Survived 1 353.6 13 116.6 7.053e-79
Class:Age:Survived 3 29.2 10 87.4 2.024e-06
Class:Sex:Survived 3 65.4 7 22.0 4.066e-14
> 1 - pchisq(22,df=7)
[1] 0.002540414
With this model we end up with a residual deviance of 22 on 7 degrees of
freedom. This is not an adequate model of the observed frequencies (p = .0025).
One or more of those deleted terms should be put back.
Other extractor functions for the glm model can be seen on the help page for
the glm() function.
Visualizing the Analysis
One of the better ways of trying to visualize the results of a log linear
analysis is to use a mosaic plot.
> mosaicplot(Titanic, shade=T) # output not shown
I have not shown the output, as it does not display well for a table this
complex except on a large computer screen. The space I have available here to
reproduce the plot would not do it justice. When "shade=" is set to TRUE, an
extended mosaic plot is produced, which allows you to visualize the
standardized residuals of a log linear analysis on the table (multi-way test of
independence, I assume), both by color and by outline of the blocks on the
plot. Negative residuals (OFs too low) have dashed borders and are shaded red.
Positive residuals (OFs too high) have solid borders and are shaded blue. A key
on the right side shows the size of the residuals by color shading.
Final Note of Caution
Log linear models are calculated using an iterative algorithm. The method
might be considered "computer intensive" in this respect. Different software
packages may program the algorithm differently, or even use a different
algorithm. Therefore, results from different software packages may not match up
exactly. Hopefully, they will be close!
revised 2016 February 10
| Table of Contents
| Function Reference
| Function Finder
| R Project |
|