LOG LINEAR ANALYSIS Preliminaries Model Formulae If you have not yet read the Model Formulae tutorial, reading it before you proceed might help you with parts of this one, model-formula-wise that is. A Little Theory Most of these tutorials have spared the theory in favor of just showing "how to do it" in R. My assumption has been that discussion of theory is either unnecessary or has been covered elsewhere. Log linear analysis didn't exist (or was in its infancy) when I took my grad courses in stats, and that may also be true for some of you, so just a little theory might be in order. For a more detailed discussion, I refer you to the excellent Chapter 17 of Howell (2007). (Note: This chapter has been removed from the current edition of Howell's textbook, but it may still be available at his website.) We will examine a data set called "Titanic", which is a built-in data set
describing the outcome of the Titanic sinking in 1912. The data object is a
four-dimensional table.
> data(Titanic) > dimnames(Titanic) $Class [1] "1st" "2nd" "3rd" "Crew" $Sex [1] "Male" "Female" $Age [1] "Child" "Adult" $Survived [1] "No" "Yes"As you can see, there are four categorical variables crosstabulated in this table. They are "Class" of the passengers (i.e., class of ticket they held) in the rows of the table (1st, 2nd, 3rd, Crew), "Sex" of the passengers in the columns (Male, Female), "Age" of the passengers in the layers or strata (Child, Adult), and whether or not the passengers "Survived" in the 4th dimension of the table (No, Yes). There are... > margin.table(Titanic) [1] 2201...cases in the table. These were the data originally collected by the British Board of Trade in its inquiry into the sinking. (For the interested, there is updated and additional data available at the Titanic Encyclopedia website.) Log linear analysis allows us to look for relationships among the variables in a multiway contingency table like this one. Like it's older sibling, chi square analysis for two-way tables, log linear analysis assumes all the observations represented in the table (not the variables, the individual observations or subjects) are independent (could be a problem in this data set) and the cell-by-cell expected frequencies will be sufficiently high, generally five or more. As long as all EFs are one or more, and no more than 20% of them are below five, the accuracy of the procedure will not be seriously jeopardized, but power will take a nose dive! (These are essentially the same assumptions made by the Pearson chi-square test.) Let's look at the data collapsed over "Class" and "Age".
> margin.table(Titanic, c(2,4)) # show dimensions 2 and 4, collapse across others Survived Sex No Yes Male 1364 367 Female 126 344We see that women were much more likely to survive than men. The odds ratio in favor of survival if you were a woman vs. a man on the Titanic was... > (344/126) / (367/1364) [1] 10.14697...better than 10:1 ("ten to one"). (Odds of a female surviving: 344/126. Odds of a male surviving: 367/1364. Odds ratio is the ratio of these two figures.) The relative likelihood (or likelihood ratio) of a female vs. a male surviving was... > (344/(344+126)) / (367/(367+1364)) [1] 3.452165...which is just the proportion of females who survived divided by the proportion of males who survived. So females were almost 3.5 times as likely to survive. This is the language of log linear analysis. A Pearson chi-square test on this table would no doubt reveal a highly significant relationship (called an "interaction" in this context) between these two factors. Chi square is the basic building block of log linear analysis, but in a
slightly different form called likelihood ratio chi square. The advantage of
this is that likelihood ratio chi squares are additive. That is, the chi
squares derived from simplier effects add together to produce the chi squares
derived from more complex effects (or "models," I should say). R (to my
knowledge) does not incorporate a likelihood ratio chi square test for two-way
tables (but see below), as it is not the normal procedure, but there is nothing
stopping us from using a likelihood ratio test on two-way tables, so here is
a script.
### begin copying script here likelihood.test = function(x) { nrows = dim(x)[1] # no. of rows in contingency table ncols = dim(x)[2] # no. of cols in contingency table chi.out = chisq.test(x,correct=F) # do a Pearson chi square test table = chi.out[[6]] # get the OFs ratios = chi.out[[6]]/chi.out[[7]] # calculate OF/EF ratios sum = 0 # storage for the test statistic for (i in 1:nrows) { for (j in 1:ncols) { sum = sum + table[i,j]*log(ratios[i,j]) } } sum = 2 * sum # the likelihood ratio chi square df = chi.out[[2]] # degrees of freedom p = 1 - pchisq(sum,df) # p-value out = c(sum, df, p, chi.out[[1]]) # the output vector names(out) = c("LR-chisq","df","p-value","Pears-chisq") round(out,4) # done! } ### end copying script here and paste into R ConsoleHighlight and copy this script and then paste it into your R Console window at the prompt. If necessary, hit the Enter key at the end. (It won't hurt if you do this anyway.) This will create a function in your workspace called likelihood.test(). It's used this way. > sex.survived = margin.table(Titanic, c(2,4)) # create the contingency table > likelihood.test(sex.survived) LR-chisq df p-value Pears-chisq 434.4688 1.0000 0.0000 456.8742The output consists of the likelihood ratio chi-square statistic, the degrees of freedom associated with it (same as for Pearson's), the p-value, and for comparison the Pearson's chi-square value. (Pardon the formatting of the output, but it's a quick and dirty script, and we won't need it for long.) We are attempting to explain (model) the frequencies in the contingency table. In a traditional chi square test (of either type), our model (null hypothesis) consists of an effect for each factor but no interaction between them. If the test allows us to reject the null hypothesis, then we conclude that this model is not adequate to explain the cell frequencies, and that to do so we must assume the factors are interacting (are not independent or are related in the traditional language of chi square). In a traditional two-way chi square test, a simpler model would usually not make sense, but just to give you an idea, suppose we proposed the following model (i.e., null hypothesis): all the cell frequencies are equal. That is to say, there is not only NOT an interaction between the two variables, but there are also no effects of either factor. There is nothing to stop us from testing this null hypothesis using a traditional chi square test, and we would do it by setting all the EFs to 2201/4=550.25. (Such a test would cost us only 1 df, by the way, since our only restriction on the data is that the EFs all add to N. So on a 2 × 2 contingency table, the test would have df=3.) In the language of log linear analysis, this all-frequencies-equal model is called the "null model." At the other extreme, we could posit a model that includes not only simple effects for each factor but also all possible interactions between them. Such a model would explain (predict, model) the cell frequencies perfectly, would result in a chi square statistic of zero on zero degrees of freedom, and would have no explanatory value at all. Such a model, in the language of log linear analysis, is called the "saturated model." End of theory. If you want a deeper explanation than this fairly superficial one, see Howell (2007). Log Linear Analysis Our goal is to explain the observed frequencies in the "Titanic" table with as simple a model as possible. Here is what we have to work with.
Notice we are not treating any of the factors differently, i.e., as response vs. explanatory variables. We could. We're just not, at this time. It's crucial that you keep this in mind. There is a natural tendency here to see "Survived" as a response variable that we are trying to explain. We are NOT (at this time). We are trying to model (explain, predict, account for) the frequencies in the contingency table. Begin like this.
> summary(Titanic) Number of cases in table: 2201 Number of factors: 4 Test for independence of all factors: Chisq = 1637.4, df = 25, p-value = 0 Chi-squared approximation may be incorrectYou have just done a log linear analysis on the model that includes only the simple effects, i.e., the effects of the factors without any interactions between them, the first four lines of the above table. This model (null hypothesis) is rejected. Somewhere in these data, there are interactions we must find to explain the cell frequencies adequately. (The warning in the last line tells us we have EFs below 5. This is the only time you'll see this warning, so it's always a good idea to start off this way, in my opinion.) Several functions in R can be used to fit log linear models. They include
loglin() in the "stats" library (loaded by default),
loglm() in the "MASS" library, and
glm() in the "stats" library. Let's experiment a
little.
> library("MASS") # load MASS so we can use loglm() > loglm( ~ Sex + Survived, data=sex.survived) Call: loglm(formula = ~ Sex + Survived, data = sex.survived) Statistics: X^2 df P(> X^2) Likelihood Ratio 434.4688 1 0 Pearson 456.8742 1 0We are looking at the 2D table "Sex" by "Survived" that we created above, we are putting in only the simple effects of individual factors, and we are getting out the same result as we got when we used my likelihood.test() function. In other words, we've used log linear modeling to do a chi-square test of independence on this two-way contingency table. The model with just the simple effects is inadequate (null hypotheis rejected, p virtually 0). There must be an interaction between these two variables (we would propose). Now let's do the same for the full data table.
> loglm(~ Class + Sex + Age + Survived, data=Titanic) Call: loglm(formula = ~Class + Sex + Age + Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 1243.663 25 0 Pearson 1637.445 25 0The result is the same one we got with the summary(Titanic) command. Essentially, we've done a four-way chi-square test of independence, and we reject the null hypothesis of independence. Somewhere in there are factors that are interacting with each other to produce the observed cell frequencies. The saturated model, on the other hand, produces nothing useful, as it always explains (predicts, models) the observed frequencies completely. > loglm(~ Class * Sex * Age * Survived, data=Titanic) Call: loglm(formula = ~Class * Sex * Age * Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 0 0 1 Pearson NaN 0 1Null hypothesis (i.e., the saturated model) retained (p virtually 1). The "truth" (we hope) lies somewhere between these two extremes. Let's try it without the fourway interaction.
> loglm(~ Class * Sex * Age * Survived - Class:Sex:Age:Survived, data=Titanic) Call: loglm(formula = ~Class * Sex * Age * Survived - Class:Sex:Age:Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 0.0002728865 3 0.9999988 Pearson NaN 3 NaNThe likelihood ratio chi-square test (the Pearson test failed) shows this model predicts cell frequencies that are not significantly different from those actually observed. The fourway interaction is expendable in our model, because when we "expended" it, we could not reject the model that doesn't have it. It does not help us understand (model) the observed frequencies. On the other hand...
> loglm(~ Class + Sex + Age + Survived + Age:Survived, data=Titanic) Call: loglm(formula = ~Class + Sex + Age + Survived + Age:Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 1224.103 24 0 Pearson 1596.846 24 0...the model with all factors plus the Age × Survived interaction DOES predict frequencies (EFs) that are significantly different from those actually observed. We need to add more interaction terms to get at the "real truth" of the matter (to create an adequate statistical model of what happened on the Titanic). Testing a Specific Hypothesis Suppose we began with the hypothesis that gender was related to survival on the Titanic. From the two-way (bivariate) chi-square test we did above, this appears to be true. However, "Sex" is also related to "Class" (p virtually zero by chi-square test) and "Age" (p virtually zero), and both "Class" and "Age" are significantly related to "Survived". So "Class" and "Age" might be confounds in the relationship between "Sex" and "Survived". Like a multiple regression for numeric variables, log linear analysis will allow us to tease apart these effects. If we remove all the interaction terms that involve both "Sex" and
"Survived", and the model still fits the observed frequencies adequately, then
we can conclude that gender and survival were unrelated.
> sat.model = loglm(~ Class * Sex * Age * Survived, data=Titanic) > sat.model Call: loglm(formula = ~Class * Sex * Age * Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 0 0 1 Pearson NaN 0 1 > ### already knew this... > model2 = update(sat.model, ~.-(Class:Sex:Age:Survived+Sex:Age:Survived+ + Class:Sex:Survived+Sex:Survived)) > model2 Call: loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age + Class:Survived + Age:Survived + Class:Sex:Age + Class:Age:Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 436.2715 8 0 Pearson NaN 8 NaNThe model does not adequately fit the observed frequencies unless we include some sort of "Sex" by "Survived" interaction. Notice the use of update here to get the new model. We could just have retyped the model formula with the factors we wanted to include into a new loglm() analysis, but update() is often a bit more convenient. The syntax reads "update sat.model by using all the same predictors (dot means "same") except subtract out (minus) the following terms." Let's do one with only the two-way interactions involving both "Sex" and "Survived". > model3 = update(sat.model, ~.-(Class:Age:Sex:Survived+Sex:Age:Survived+ + Class:Sex:Survived)) > model3 Call: loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age + Class:Survived + Sex:Survived + Age:Survived + Class:Sex:Age + Class:Age:Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 76.90406 7 5.884182e-14 Pearson NaN 7 NaNNope! This model is also rejected. Apparently, the "Sex" by "Survived" interaction is conditioned on (dependent upon, changes form with) another one of the factors. I.e., there is an important three-way interaction involving "Sex", "Survived", and one of the other variables. As you can well imagine, this could be a tedious and error prone way to construct a log linear model. R provides a way to automate model building. The step() Function Although I am generally hostile to computer-automated statistical model
building, in this circumstance I am willing to make an exception.
> step(sat.model, direction="backward") Start: AIC=64 ~Class * Sex * Age * Survived Df AIC - Class:Sex:Age:Survived 3 58The procedure here is to build the saturated model ("sat.model") and then subject it to what might be loosely called "backward regression" using the step() function with "direction=" set to "backward" (other possibilites are "forward" and "both"). This will simplify the model using a criterion called Akaike's Information Criterion (AIC). Smaller values of AIC are better. In this case, R has arrived at a model in which only two interactions have been removed, the four-way and Sex:Age:Survived. It appears the relationship between "Sex" and "Survived" is conditioned on "Class". Let's check this out... > margin.table(Titanic, c(2,4,1)) , , Class = 1st Survived Sex No Yes Male 118 62 Female 4 141 , , Class = 2nd Survived Sex No Yes Male 154 25 Female 13 93 , , Class = 3rd Survived Sex No Yes Male 422 88 Female 106 90 , , Class = Crew Survived Sex No Yes Male 670 192 Female 3 20The odds ratios are easily enough calculated by hand from these tables. > ### odds ratio 1st class > (141/4) / (62/118) [1] 67.08871 > ### odds ratio 2nd class > (93/13) / (25/154) [1] 44.06769 > ### odds ratio 3rd class > (90/106) / (88/422) [1] 4.071612 > ### odds ratio crew > (20/3) / (192/670) [1] 23.26389In all classes the odds of a female surviving were better than the odds of a male surviving, but this was especially true in first class, and only somewhat true in third class. Getting More Information From The Model Let's say we want to go with the model produced by
step() as our
model of what happened on the Titanic. First, I'll store the model in a data
object (and I will copy and paste the command to produce the model from the
output of step(), because it's unlikely I'd
type all that correctly).
> loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + + Sex:Age + Class:Survived + Sex:Survived + Age:Survived + + Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived, + data = Titanic) -> step.modelFollowing are some of the extractor functions available for getting information from the model object (see the help page for a complete list). > print(step.model) # same as just typing step.model Call: loglm(formula = ~Class + Sex + Age + Survived + Class:Sex + Class:Age + Sex:Age + Class:Survived + Sex:Survived + Age:Survived + Class:Sex:Age + Class:Sex:Survived + Class:Age:Survived, data = Titanic) Statistics: X^2 df P(> X^2) Likelihood Ratio 1.685479 4 0.7933536 Pearson NaN 4 NaN > fitted(step.model) # the EFs produced by the model Re-fitting to get fitted values , , Age = Child, Survived = No Sex Class Male Female 1st 0.00000 0.00000 2nd 0.00000 0.00000 3rd 37.43281 14.56719 Crew 0.00000 0.00000 , , Age = Adult, Survived = No Sex Class Male Female 1st 118.0000 4.0000 2nd 154.0000 13.0000 3rd 384.5672 91.4328 Crew 670.0000 3.0000 , , Age = Child, Survived = Yes Sex Class Male Female 1st 5.00000 1.00000 2nd 10.98493 13.01507 3rd 10.56718 16.43282 Crew 0.00000 0.00000 , , Age = Adult, Survived = Yes Sex Class Male Female 1st 57.00000 140.00000 2nd 14.02291 79.97709 3rd 77.43281 73.56719 Crew 192.00000 20.00000 > resid(step.model) # the standardized residuals Re-fitting to get frequencies and fitted values , , Age = Child, Survived = No Sex Class Male Female 1st 0.000000e+00 0.000000e+00 2nd 0.000000e+00 0.000000e+00 3rd -4.020602e-01 6.208006e-01 Crew 0.000000e+00 0.000000e+00 , , Age = Adult, Survived = No Sex Class Male Female 1st 0.000000e+00 0.000000e+00 2nd 0.000000e+00 0.000000e+00 3rd 1.239264e-01 -2.555637e-01 Crew 0.000000e+00 0.000000e+00 , , Age = Child, Survived = Yes Sex Class Male Female 1st 2.142148e-08 -4.552313e-08 2nd 4.546268e-03 -4.178434e-03 3rd 7.221252e-01 -6.159455e-01 Crew 0.000000e+00 0.000000e+00 , , Age = Adult, Survived = Yes Sex Class Male Female 1st -6.825888e-08 6.182584e-08 2nd -6.118921e-03 2.561368e-03 3rd -2.779358e-01 2.820973e-01 Crew 0.000000e+00 0.000000e+00Ordinarily, you can also get regression coefficients-- coef(step.model) --but that failed in this case. I'm not sure why. In any event, I don't see much point in getting the coefficients, as they are only useful for calculating EFs, which we already have. After all, it's not like we are going to come up with a new level of "Survived" for which we need to make predictions! It's worth noting that our EFs contain several zeros. Our model is in danger of being inaccurate for this reason. Using glm() For Log Linear Modeling We can also do log linear analysis using the
glm() function, which is
the function used for generalized linear models. For this we need a data frame
rather than a contingency table, but this is easy enough to get.
> ti = as.data.frame(Titanic) > ti # it's a lot easier to look at the data this way too! Class Sex Age Survived Freq 1 1st Male Child No 0 2 2nd Male Child No 0 3 3rd Male Child No 35 4 Crew Male Child No 0 5 1st Female Child No 0 6 2nd Female Child No 0 7 3rd Female Child No 17 8 Crew Female Child No 0 9 1st Male Adult No 118 10 2nd Male Adult No 154 11 3rd Male Adult No 387 12 Crew Male Adult No 670 13 1st Female Adult No 4 14 2nd Female Adult No 13 15 3rd Female Adult No 89 16 Crew Female Adult No 3 17 1st Male Child Yes 5 18 2nd Male Child Yes 11 19 3rd Male Child Yes 13 20 Crew Male Child Yes 0 21 1st Female Child Yes 1 22 2nd Female Child Yes 13 23 3rd Female Child Yes 14 24 Crew Female Child Yes 0 25 1st Male Adult Yes 57 26 2nd Male Adult Yes 14 27 3rd Male Adult Yes 75 28 Crew Male Adult Yes 192 29 1st Female Adult Yes 140 30 2nd Female Adult Yes 80 31 3rd Female Adult Yes 76 32 Crew Female Adult Yes 20The log linear analysis is then produced this way. > glm.model = glm(Freq ~ Class * Age * Sex * Survived, data=ti, family=poisson)This is the saturated model. The extractor function summary(glm.model) would produce a wealth of puzzling output, but the extractor function I like best is... > anova(glm.model, test="Chisq") Analysis of Deviance Table Model: poisson, link: log Response: Freq Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 31 4953.1 Class 3 475.8 28 4477.3 8.326e-103 Age 1 2183.6 27 2293.8 0.0 Sex 1 768.3 26 1525.4 4.169e-169 Survived 1 281.8 25 1243.7 3.078e-63 Class:Age 3 148.3 22 1095.3 6.048e-32 Class:Sex 3 412.6 19 682.7 4.126e-89 Age:Sex 1 6.1 18 676.6 1.363e-02 Class:Survived 3 180.9 15 495.7 5.634e-39 Age:Survived 1 25.6 14 470.2 4.237e-07 Sex:Survived 1 353.6 13 116.6 7.053e-79 Class:Age:Sex 3 4.0 10 112.6 0.3 Class:Age:Survived 3 35.7 7 76.9 8.825e-08 Class:Sex:Survived 3 75.2 4 1.7 3.253e-16 Age:Sex:Survived 1 1.7 3 4.237e-10 0.2 Class:Age:Sex:Survived 3 0.0 0 4.463e-10 1.0Deviance is a fancy term for the likelihood ratio chi-square value. The null model (all frequencies equal) is at the top. Clearly, the null model is not sufficient to describe this data set, as adding in subsequent terms produces significant reductions in deviance. At each step down the table, another predictor is added in, and this removes the amount of Deviance shown in that column, leaving the amount in the Resid. Dev. column. The p-value in the last column is based on the change in Deviance (column two) and the change in Df (column one) produced by adding that term into the model. Here we are looking for a significant reduction in deviance, indicating the new term helped in predicting cell (observed) frequencies. This output suggests we might want to investigate three terms for possible
elimation: the four-way interaction and two three-way interactions. Recall
that step() retained the Class:Age:Sex interaction.
> anova(update(glm.model,.~.-(Class:Age:Sex:Survived+Age:Sex:Survived + +Class:Age:Sex)),test="Chisq") Analysis of Deviance Table Model: poisson, link: log Response: Freq Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 31 4953.1 Class 3 475.8 28 4477.3 8.326e-103 Age 1 2183.6 27 2293.8 0.0 Sex 1 768.3 26 1525.4 4.169e-169 Survived 1 281.8 25 1243.7 3.078e-63 Class:Age 3 148.3 22 1095.3 6.048e-32 Class:Sex 3 412.6 19 682.7 4.126e-89 Age:Sex 1 6.1 18 676.6 1.363e-02 Class:Survived 3 180.9 15 495.7 5.634e-39 Age:Survived 1 25.6 14 470.2 4.237e-07 Sex:Survived 1 353.6 13 116.6 7.053e-79 Class:Age:Survived 3 29.2 10 87.4 2.024e-06 Class:Sex:Survived 3 65.4 7 22.0 4.066e-14 > 1 - pchisq(22,df=7) [1] 0.002540414With this model we end up with a residual deviance of 22 on 7 degrees of freedom. This is not an adequate model of the observed frequencies (p = .0025). One or more of those deleted terms should be put back. Other extractor functions for the glm model can be seen on the help page for the glm() function. Visualizing the Analysis One of the better ways of trying to visualize the results of a log linear
analysis is to use a mosaic plot.
> mosaicplot(Titanic, shade=T) # output not shownI have not shown the output, as it does not display well for a table this complex except on a large computer screen. The space I have available here to reproduce the plot would not do it justice. When "shade=" is set to TRUE, an extended mosaic plot is produced, which allows you to visualize the standardized residuals of a log linear analysis on the table (multi-way test of independence, I assume), both by color and by outline of the blocks on the plot. Negative residuals (OFs too low) have dashed borders and are shaded red. Positive residuals (OFs too high) have solid borders and are shaded blue. A key on the right side shows the size of the residuals by color shading. Final Note of Caution Log linear models are calculated using an iterative algorithm. The method might be considered "computer intensive" in this respect. Different software packages may program the algorithm differently, or even use a different algorithm. Therefore, results from different software packages may not match up exactly. Hopefully, they will be close! revised 2016 February 10 |