R Tutorials--Log Linear Analysis

LOG LINEAR ANALYSIS

Preliminaries

Model Formulae

If you have not yet read the Model Formulae tutorial, reading it before you proceed might help you with parts of this one, model-formula-wise that is.

A Little Theory

Most of these tutorials have spared the theory in favor of just showing "how to do it" in R. My assumption has been that discussion of theory is either unnecessary or has been covered elsewhere. Log linear analysis didn't exist (or was in its infancy) when I took my grad courses in stats, and that may also be true for some of you, so just a little theory might be in order. For a more detailed discussion, I refer you to the excellent Chapter 17 of Howell (2007). (Note: This chapter has been removed from the current edition of Howell's textbook, but it may still be available at his website.)

We will examine a data set called "Titanic", which is a built-in data set describing the outcome of the Titanic sinking in 1912. The data object is a four-dimensional table.

> data(Titanic)
> dimnames(Titanic)
$Class
[1] "1st"  "2nd"  "3rd"  "Crew"

$Sex
[1] "Male"   "Female"

$Age
[1] "Child" "Adult"

$Survived
[1] "No"  "Yes"

As you can see, there are four categorical variables crosstabulated in this table. They are "Class" of the passengers (i.e., class of ticket they held) in the rows of the table (1st, 2nd, 3rd, Crew), "Sex" of the passengers in the columns (Male, Female), "Age" of the passengers in the layers or strata (Child, Adult), and whether or not the passengers "Survived" in the 4th dimension of the table (No, Yes). There are...

> margin.table(Titanic)
[1] 2201

...cases in the table. These were the data originally collected by the British Board of Trade in its inquiry into the sinking. (For the interested, there is updated and additional data available at the Titanic Encyclopedia website.)

Log linear analysis allows us to look for relationships among the variables in a multiway contingency table like this one. Like it's older sibling, chi square analysis for two-way tables, log linear analysis assumes all the observations represented in the table (not the variables, the individual observations or subjects) are independent (could be a problem in this data set) and the cell-by-cell expected frequencies will be sufficiently high, generally five or more. As long as all EFs are one or more, and no more than 20% of them are below five, the accuracy of the procedure will not be seriously jeopardized, but power will take a nose dive! (These are essentially the same assumptions made by the Pearson chi-square test.)

Let's look at the data collapsed over "Class" and "Age".

> margin.table(Titanic, c(2,4))        # show dimensions 2 and 4, collapse across others
        Survived
Sex        No  Yes
  Male   1364  367
  Female  126  344

We see that women were much more likely to survive than men. The odds ratio in favor of survival if you were a woman vs. a man on the Titanic was...

> (344/126) / (367/1364)
[1] 10.14697

...better than 10:1 ("ten to one"). (Odds of a female surviving: 344/126. Odds of a male surviving: 367/1364. Odds ratio is the ratio of these two figures.) The relative likelihood (or likelihood ratio) of a female vs. a male surviving was...

> (344/(344+126)) / (367/(367+1364))
[1] 3.452165

...which is just the proportion of females who survived divided by the proportion of males who survived. So females were almost 3.5 times as likely to survive. This is the language of log linear analysis. A Pearson chi-square test on this table would no doubt reveal a highly significant relationship (called an "interaction" in this context) between these two factors.

Chi square is the basic building block of log linear analysis, but in a slightly different form called likelihood ratio chi square. The advantage of this is that likelihood ratio chi squares are additive. That is, the chi squares derived from simplier effects add together to produce the chi squares derived from more complex effects (or "models," I should say). R (to my knowledge) does not incorporate a likelihood ratio chi square test for two-way tables (but see below), as it is not the normal procedure, but there is nothing stopping us from using a likelihood ratio test on two-way tables, so here is a script.

### begin copying script here
likelihood.test = function(x) {
nrows = dim(x)[1]                      # no. of rows in contingency table
ncols = dim(x)[2]                      # no. of cols in contingency table
chi.out = chisq.test(x,correct=F)      # do a Pearson chi square test
table = chi.out[[6]]                   # get the OFs
ratios = chi.out[[6]]/chi.out[[7]]     # calculate OF/EF ratios
sum = 0                                # storage for the test statistic
for (i in 1:nrows) {
     for (j in 1:ncols) {
          sum = sum + table[i,j]*log(ratios[i,j])
     }
}
sum = 2 * sum                          # the likelihood ratio chi square
df = chi.out[[2]]                      # degrees of freedom
p = 1 - pchisq(sum,df)                 # p-value
out = c(sum, df, p, chi.out[[1]])      # the output vector
names(out) = c("LR-chisq","df","p-value","Pears-chisq")
round(out,4)                           # done!
}
### end copying script here and paste into R Console

Highlight and copy this script and then paste it into your R Console window at the prompt. If necessary, hit the Enter key at the end. (It won't hurt if you do this anyway.) This will create a function in your workspace called likelihood.test(). It's used this way.

> sex.survived = margin.table(Titanic, c(2,4))   # create the contingency table
> likelihood.test(sex.survived)
   LR-chisq          df     p-value Pears-chisq 
   434.4688      1.0000      0.0000    456.8742

The output consists of the likelihood ratio chi-square statistic, the degrees of freedom associated with it (same as for Pearson's), the p-value, and for comparison the Pearson's chi-square value. (Pardon the formatting of the output, but it's a quick and dirty script, and we won't need it for long.)

We are attempting to explain (model) the frequencies in the contingency table. In a traditional chi square test (of either type), our model (null hypothesis) consists of an effect for each factor but no interaction between them. If the test allows us to reject the null hypothesis, then we conclude that this model is not adequate to explain the cell frequencies, and that to do so we must assume the factors are interacting (are not independent or are related in the traditional language of chi square).

In a traditional two-way chi square test, a simpler model would usually not make sense, but just to give you an idea, suppose we proposed the following model (i.e., null hypothesis): all the cell frequencies are equal. That is to say, there is not only NOT an interaction between the two variables, but there are also no effects of either factor. There is nothing to stop us from testing this null hypothesis using a traditional chi square test, and we would do it by setting all the EFs to 2201/4=550.25. (Such a test would cost us only 1 df, by the way, since our only restriction on the data is that the EFs all add to N. So on a 2 × 2 contingency table, the test would have df=3.) In the language of log linear analysis, this all-frequencies-equal model is called the "null model."

At the other extreme, we could posit a model that includes not only simple effects for each factor but also all possible interactions between them. Such a model would explain (predict, model) the cell frequencies perfectly, would result in a chi square statistic of zero on zero degrees of freedom, and would have no explanatory value at all. Such a model, in the language of log linear analysis, is called the "saturated model."

End of theory. If you want a deeper explanation than this fairly superficial one, see Howell (2007).

Log Linear Analysis

Our goal is to explain the observed frequencies in the "Titanic" table with as simple a model as possible. Here is what we have to work with.

EFFECTS	WHAT THEY MEAN
Class	there are more passengers in some classes than in others
Sex	there are more passengers of one sex than of the other
Age	there are more passengers of one age group than of the other
Survived	more passengers lived or died than the alternative
Class × Sex	Class and Sex are not independent
Class × Age	Class and Age are not independent
Class × Survived	Class and Survived are not independent
Sex × Age	Sex and Age are not independent
Sex × Survived	Sex and Survived are not independent
Age × Survived	Age and Survived are not independent
Class × Sex × Age	there is a three-way interaction between these factors
Class × Sex × Survived	there is a three-way interaction between these factors
Class × Age × Survived	there is a three-way interaction between these factors
Sex × Age × Survived	there is a three-way interaction between these factors
Class × Sex × Age × Survived	there is a four-way interaction between these factors

Notice we are not treating any of the factors differently, i.e., as response vs. explanatory variables. We could. We're just not, at this time. It's crucial that you keep this in mind. There is a natural tendency here to see "Survived" as a response variable that we are trying to explain. We are NOT (at this time). We are trying to model (explain, predict, account for) the frequencies in the contingency table.

Begin like this.

> summary(Titanic)
Number of cases in table: 2201 
Number of factors: 4 
Test for independence of all factors:
        Chisq = 1637.4, df = 25, p-value = 0
        Chi-squared approximation may be incorrect

You have just done a log linear analysis on the model that includes only the simple effects, i.e., the effects of the factors without any interactions between them, the first four lines of the above table. This model (null hypothesis) is rejected. Somewhere in these data, there are interactions we must find to explain the cell frequencies adequately. (The warning in the last line tells us we have EFs below 5. This is the only time you'll see this warning, so it's always a good idea to start off this way, in my opinion.)

Several functions in R can be used to fit log linear models. They include loglin() in the "stats" library (loaded by default), loglm() in the "MASS" library, and glm() in the "stats" library. Let's experiment a little.

> library("MASS")                      # load MASS so we can use loglm()
> loglm( ~ Sex + Survived, data=sex.survived)
Call:
loglm(formula = ~ Sex + Survived, data = sex.survived)

Statistics:
                      X^2 df P(> X^2)
Likelihood Ratio 434.4688  1        0
Pearson          456.8742  1        0

We are looking at the 2D table "Sex" by "Survived" that we created above, we are putting in only the simple effects of individual factors, and we are getting out the same result as we got when we used my likelihood.test() function. In other words, we've used log linear modeling to do a chi-square test of independence on this two-way contingency table. The model with just the simple effects is inadequate (null hypotheis rejected, p virtually 0). There must be an interaction between these two variables (we would propose).

Now let's do the same for the full data table.

> loglm(~ Class + Sex + Age + Survived, data=Titanic)
Call:
loglm(formula = ~Class + Sex + Age + Survived, data = Titanic)

Statistics:
                      X^2 df P(> X^2)
Likelihood Ratio 1243.663 25        0
Pearson          1637.445 25        0

The result is the same one we got with the summary(Titanic) command. Essentially, we've done a four-way chi-square test of independence, and we reject the null hypothesis of independence. Somewhere in there are factors that are interacting with each other to produce the observed cell frequencies. The saturated model, on the other hand, produces nothing useful, as it always explains (predicts, models) the observed frequencies completely.

> loglm(~ Class * Sex * Age * Survived, data=Titanic)
Call:
loglm(formula = ~Class * Sex * Age * Survived, data = Titanic)

Statistics:
                 X^2 df P(> X^2)
Likelihood Ratio   0  0        1
Pearson          NaN  0        1

Null hypothesis (i.e., the saturated model) retained (p virtually 1). The "truth" (we hope) lies somewhere between these two extremes.

Let's try it without the fourway interaction.

> loglm(~ Class * Sex * Age * Survived - Class:Sex:Age:Survived, data=Titanic)
Call:
loglm(formula = ~Class * Sex * Age * Survived - Class:Sex:Age:Survived, 
    data = Titanic)

Statistics:
                          X^2 df  P(> X^2)
Likelihood Ratio 0.0002728865  3 0.9999988
Pearson                   NaN  3       NaN

The likelihood ratio chi-square test (the Pearson test failed) shows this model predicts cell frequencies that are not significantly different from those actually observed. The fourway interaction is expendable in our model, because when we "expended" it, we could not reject the model that doesn't have it. It does not help us understand (model) the observed frequencies.

On the other hand...

> loglm(~ Class + Sex + Age + Survived + Age:Survived, data=Titanic)
Call:
loglm(formula = ~Class + Sex + Age + Survived + Age:Survived, 
    data = Titanic)

Statistics:
                      X^2 df P(> X^2)
Likelihood Ratio 1224.103 24        0
Pearson          1596.846 24        0

...the model with all factors plus the Age × Survived interaction DOES predict frequencies (EFs) that are significantly different from those actually observed. We need to add more interaction terms to get at the "real truth" of the matter (to create an adequate statistical model of what happened on the Titanic).

Testing a Specific Hypothesis

Suppose we began with the hypothesis that gender was related to survival on the Titanic. From the two-way (bivariate) chi-square test we did above, this appears to be true. However, "Sex" is also related to "Class" (p virtually zero by chi-square test) and "Age" (p virtually zero), and both "Class" and "Age" are significantly related to "Survived". So "Class" and "Age" might be confounds in the relationship between "Sex" and "Survived". Like a multiple regression for numeric variables, log linear analysis will allow us to tease apart these effects.

If we remove all the interaction terms that involve both "Sex" and "Survived", and the model still fits the observed frequencies adequately, then we can conclude that gender and survival were unrelated.

> sat.model = loglm(~ Class * Sex * Age * Survived, data=Titanic)
> sat.model
Call:
loglm(formula = ~Class * Sex * Age * Survived, data = Titanic)

Statistics:
                 X^2 df P(> X^2)
Likelihood Ratio   0  0        1
Pearson          NaN  0        1
> ### already knew this...
> model2 = update(sat.model, ~.-(Class:Sex:Age:Survived+Sex:Age:Survived+
+                                Class:Sex:Survived+Sex:Survived))
> model2
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + 
    Sex:Age + Class:Survived + Age:Survived + Class:Sex:Age + 
    Class:Age:Survived, data = Titanic)

Statistics:
                      X^2 df P(> X^2)
Likelihood Ratio 436.2715  8        0
Pearson               NaN  8      NaN

The model does not adequately fit the observed frequencies unless we include some sort of "Sex" by "Survived" interaction. Notice the use of update here to get the new model. We could just have retyped the model formula with the factors we wanted to include into a new loglm() analysis, but update() is often a bit more convenient. The syntax reads "update sat.model by using all the same predictors (dot means "same") except subtract out (minus) the following terms." Let's do one with only the two-way interactions involving both "Sex" and "Survived".

> model3 = update(sat.model, ~.-(Class:Age:Sex:Survived+Sex:Age:Survived+
+                                Class:Sex:Survived))
> model3
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + 
    Sex:Age + Class:Survived + Sex:Survived + Age:Survived + 
    Class:Sex:Age + Class:Age:Survived, data = Titanic)

Statistics:
                      X^2 df     P(> X^2)
Likelihood Ratio 76.90406  7 5.884182e-14
Pearson               NaN  7          NaN

Nope! This model is also rejected. Apparently, the "Sex" by "Survived" interaction is conditioned on (dependent upon, changes form with) another one of the factors. I.e., there is an important three-way interaction involving "Sex", "Survived", and one of the other variables.

As you can well imagine, this could be a tedious and error prone way to construct a log linear model. R provides a way to automate model building.

The step() Function

Although I am generally hostile to computer-automated statistical model building, in this circumstance I am willing to make an exception.

> step(sat.model, direction="backward")
Start:  AIC=64
~Class * Sex * Age * Survived

                         Df AIC
- Class:Sex:Age:Survived  3  58
                       64

Step:  AIC=58
~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age + 
    Class:Survived + Sex:Survived + Age:Survived + Class:Sex:Age + 
    Class:Sex:Survived + Class:Age:Survived + Sex:Age:Survived

                     Df     AIC
- Sex:Age:Survived    1  57.685
                   58.000
- Class:Sex:Age       3  61.783
- Class:Age:Survived  3  89.263
- Class:Sex:Survived  3 117.013

Step:  AIC=57.69
~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age + 
    Class:Survived + Sex:Survived + Age:Survived + Class:Sex:Age + 
    Class:Sex:Survived + Class:Age:Survived

                     Df     AIC
                   57.685
- Class:Sex:Age       3  71.953
- Class:Age:Survived  3  95.899
- Class:Sex:Survived  3 126.904
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + 
    Sex:Age + Class:Survived + Sex:Survived + Age:Survived + 
    Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived, 
    data = Titanic, evaluate = FALSE)

Statistics:
                      X^2 df  P(> X^2)
Likelihood Ratio 1.685479  4 0.7933536
Pearson               NaN  4       NaN

The procedure here is to build the saturated model ("sat.model") and then subject it to what might be loosely called "backward regression" using the step() function with "direction=" set to "backward" (other possibilites are "forward" and "both"). This will simplify the model using a criterion called Akaike's Information Criterion (AIC). Smaller values of AIC are better. In this case, R has arrived at a model in which only two interactions have been removed, the four-way and Sex:Age:Survived. It appears the relationship between "Sex" and "Survived" is conditioned on "Class". Let's check this out...

> margin.table(Titanic, c(2,4,1))
, , Class = 1st

        Survived
Sex       No Yes
  Male   118  62
  Female   4 141

, , Class = 2nd

        Survived
Sex       No Yes
  Male   154  25
  Female  13  93

, , Class = 3rd

        Survived
Sex       No Yes
  Male   422  88
  Female 106  90

, , Class = Crew

        Survived
Sex       No Yes
  Male   670 192
  Female   3  20

The odds ratios are easily enough calculated by hand from these tables.

> ### odds ratio 1st class
> (141/4) / (62/118)
[1] 67.08871
> ### odds ratio 2nd class
> (93/13) / (25/154)
[1] 44.06769
> ### odds ratio 3rd class
> (90/106) / (88/422)
[1] 4.071612
> ### odds ratio crew
> (20/3) / (192/670)
[1] 23.26389

In all classes the odds of a female surviving were better than the odds of a male surviving, but this was especially true in first class, and only somewhat true in third class.

Getting More Information From The Model

Let's say we want to go with the model produced by step() as our model of what happened on the Titanic. First, I'll store the model in a data object (and I will copy and paste the command to produce the model from the output of step(), because it's unlikely I'd type all that correctly).

> loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + 
+     Sex:Age + Class:Survived + Sex:Survived + Age:Survived + 
+     Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived, 
+     data = Titanic) -> step.model

Following are some of the extractor functions available for getting information from the model object (see the help page for a complete list).

> print(step.model)                    # same as just typing step.model
Call:
loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + 
    Sex:Age + Class:Survived + Sex:Survived + Age:Survived + 
    Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived, 
    data = Titanic)

Statistics:
                      X^2 df  P(> X^2)
Likelihood Ratio 1.685479  4 0.7933536
Pearson               NaN  4       NaN

> fitted(step.model)                   # the EFs produced by the model
Re-fitting to get fitted values
, , Age = Child, Survived = No

      Sex
Class      Male   Female
  1st   0.00000  0.00000
  2nd   0.00000  0.00000
  3rd  37.43281 14.56719
  Crew  0.00000  0.00000

, , Age = Adult, Survived = No

      Sex
Class      Male  Female
  1st  118.0000  4.0000
  2nd  154.0000 13.0000
  3rd  384.5672 91.4328
  Crew 670.0000  3.0000

, , Age = Child, Survived = Yes

      Sex
Class      Male   Female
  1st   5.00000  1.00000
  2nd  10.98493 13.01507
  3rd  10.56718 16.43282
  Crew  0.00000  0.00000

, , Age = Adult, Survived = Yes

      Sex
Class       Male    Female
  1st   57.00000 140.00000
  2nd   14.02291  79.97709
  3rd   77.43281  73.56719
  Crew 192.00000  20.00000

> resid(step.model)                    # the standardized residuals
Re-fitting to get frequencies and fitted values
, , Age = Child, Survived = No

      Sex
Class           Male        Female
  1st   0.000000e+00  0.000000e+00
  2nd   0.000000e+00  0.000000e+00
  3rd  -4.020602e-01  6.208006e-01
  Crew  0.000000e+00  0.000000e+00

, , Age = Adult, Survived = No

      Sex
Class           Male        Female
  1st   0.000000e+00  0.000000e+00
  2nd   0.000000e+00  0.000000e+00
  3rd   1.239264e-01 -2.555637e-01
  Crew  0.000000e+00  0.000000e+00

, , Age = Child, Survived = Yes

      Sex
Class           Male        Female
  1st   2.142148e-08 -4.552313e-08
  2nd   4.546268e-03 -4.178434e-03
  3rd   7.221252e-01 -6.159455e-01
  Crew  0.000000e+00  0.000000e+00

, , Age = Adult, Survived = Yes

      Sex
Class           Male        Female
  1st  -6.825888e-08  6.182584e-08
  2nd  -6.118921e-03  2.561368e-03
  3rd  -2.779358e-01  2.820973e-01
  Crew  0.000000e+00  0.000000e+00

Ordinarily, you can also get regression coefficients-- coef(step.model) --but that failed in this case. I'm not sure why. In any event, I don't see much point in getting the coefficients, as they are only useful for calculating EFs, which we already have. After all, it's not like we are going to come up with a new level of "Survived" for which we need to make predictions!

It's worth noting that our EFs contain several zeros. Our model is in danger of being inaccurate for this reason.

Using glm() For Log Linear Modeling

We can also do log linear analysis using the glm() function, which is the function used for generalized linear models. For this we need a data frame rather than a contingency table, but this is easy enough to get.

> ti = as.data.frame(Titanic)
> ti                                   # it's a lot easier to look at the data this way too!
   Class    Sex   Age Survived Freq
1    1st   Male Child       No    0
2    2nd   Male Child       No    0
3    3rd   Male Child       No   35
4   Crew   Male Child       No    0
5    1st Female Child       No    0
6    2nd Female Child       No    0
7    3rd Female Child       No   17
8   Crew Female Child       No    0
9    1st   Male Adult       No  118
10   2nd   Male Adult       No  154
11   3rd   Male Adult       No  387
12  Crew   Male Adult       No  670
13   1st Female Adult       No    4
14   2nd Female Adult       No   13
15   3rd Female Adult       No   89
16  Crew Female Adult       No    3
17   1st   Male Child      Yes    5
18   2nd   Male Child      Yes   11
19   3rd   Male Child      Yes   13
20  Crew   Male Child      Yes    0
21   1st Female Child      Yes    1
22   2nd Female Child      Yes   13
23   3rd Female Child      Yes   14
24  Crew Female Child      Yes    0
25   1st   Male Adult      Yes   57
26   2nd   Male Adult      Yes   14
27   3rd   Male Adult      Yes   75
28  Crew   Male Adult      Yes  192
29   1st Female Adult      Yes  140
30   2nd Female Adult      Yes   80
31   3rd Female Adult      Yes   76
32  Crew Female Adult      Yes   20

The log linear analysis is then produced this way.

> glm.model = glm(Freq ~ Class * Age * Sex * Survived, data=ti, family=poisson)

This is the saturated model. The extractor function summary(glm.model) would produce a wealth of puzzling output, but the extractor function I like best is...

> anova(glm.model, test="Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: Freq

Terms added sequentially (first to last)


                       Df Deviance Resid. Df Resid. Dev  P(>|Chi|)
NULL                                      31     4953.1           
Class                   3    475.8        28     4477.3 8.326e-103
Age                     1   2183.6        27     2293.8        0.0
Sex                     1    768.3        26     1525.4 4.169e-169
Survived                1    281.8        25     1243.7  3.078e-63
Class:Age               3    148.3        22     1095.3  6.048e-32
Class:Sex               3    412.6        19      682.7  4.126e-89
Age:Sex                 1      6.1        18      676.6  1.363e-02
Class:Survived          3    180.9        15      495.7  5.634e-39
Age:Survived            1     25.6        14      470.2  4.237e-07
Sex:Survived            1    353.6        13      116.6  7.053e-79
Class:Age:Sex           3      4.0        10      112.6        0.3
Class:Age:Survived      3     35.7         7       76.9  8.825e-08
Class:Sex:Survived      3     75.2         4        1.7  3.253e-16
Age:Sex:Survived        1      1.7         3  4.237e-10        0.2
Class:Age:Sex:Survived  3      0.0         0  4.463e-10        1.0

Deviance is a fancy term for the likelihood ratio chi-square value. The null model (all frequencies equal) is at the top. Clearly, the null model is not sufficient to describe this data set, as adding in subsequent terms produces significant reductions in deviance. At each step down the table, another predictor is added in, and this removes the amount of Deviance shown in that column, leaving the amount in the Resid. Dev. column. The p-value in the last column is based on the change in Deviance (column two) and the change in Df (column one) produced by adding that term into the model. Here we are looking for a significant reduction in deviance, indicating the new term helped in predicting cell (observed) frequencies.

This output suggests we might want to investigate three terms for possible elimation: the four-way interaction and two three-way interactions. Recall that step() retained the Class:Age:Sex interaction.

> anova(update(glm.model,.~.-(Class:Age:Sex:Survived+Age:Sex:Survived
+                             +Class:Age:Sex)),test="Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: Freq

Terms added sequentially (first to last)


                   Df Deviance Resid. Df Resid. Dev  P(>|Chi|)
NULL                                  31     4953.1           
Class               3    475.8        28     4477.3 8.326e-103
Age                 1   2183.6        27     2293.8        0.0
Sex                 1    768.3        26     1525.4 4.169e-169
Survived            1    281.8        25     1243.7  3.078e-63
Class:Age           3    148.3        22     1095.3  6.048e-32
Class:Sex           3    412.6        19      682.7  4.126e-89
Age:Sex             1      6.1        18      676.6  1.363e-02
Class:Survived      3    180.9        15      495.7  5.634e-39
Age:Survived        1     25.6        14      470.2  4.237e-07
Sex:Survived        1    353.6        13      116.6  7.053e-79
Class:Age:Survived  3     29.2        10       87.4  2.024e-06
Class:Sex:Survived  3     65.4         7       22.0  4.066e-14

> 1 - pchisq(22,df=7)
[1] 0.002540414

With this model we end up with a residual deviance of 22 on 7 degrees of freedom. This is not an adequate model of the observed frequencies (p = .0025). One or more of those deleted terms should be put back.

Other extractor functions for the glm model can be seen on the help page for the glm() function.

Visualizing the Analysis

One of the better ways of trying to visualize the results of a log linear analysis is to use a mosaic plot.

> mosaicplot(Titanic, shade=T)         # output not shown

I have not shown the output, as it does not display well for a table this complex except on a large computer screen. The space I have available here to reproduce the plot would not do it justice. When "shade=" is set to TRUE, an extended mosaic plot is produced, which allows you to visualize the standardized residuals of a log linear analysis on the table (multi-way test of independence, I assume), both by color and by outline of the blocks on the plot. Negative residuals (OFs too low) have dashed borders and are shaded red. Positive residuals (OFs too high) have solid borders and are shaded blue. A key on the right side shows the size of the residuals by color shading.

Final Note of Caution

Log linear models are calculated using an iterative algorithm. The method might be considered "computer intensive" in this respect. Different software packages may program the algorithm differently, or even use a different algorithm. Therefore, results from different software packages may not match up exactly. Hopefully, they will be close!

revised 2016 February 10