| Table of Contents | Function Reference | Function Finder | R Project | LOGISTIC REGRESSION Preliminaries Model Formulae You will need to know a bit about Model Formulae to understand this tutorial. Odds, Odds Ratios, and Logit When you go to the track, how do you know which horse to bet on? You look at the odds. In the program, you may see the odds for your horse, Sea Brisket, are 8 to 1, which are the odds AGAINST winning. This means in nine races Sea Brisket would be expected to win 1 and lose 8. In probability terms, Sea Brisket has a probability of winning of 1/9, or 0.111. But the odds of winning are 1:8, 1/8, or 0.125. Odds are actually the ratio of two probabilities. ``` p(one outcome) p(success) p odds = -------------------- = ----------- = ---, where q = 1 - p p(the other outcome) p(failure) q``` So for Sea Brisket, odds(winning) = (1/9)/(8/9) = 1/8. Notice that odds have these properties: If p(success) = p(failure), then odds(success) = 1 (or 1 to 1, or 1:1). If p(success) < p(failure), then odds(success) < 1. If p(success) > p(failure), then odds(success) > 1. Unlike probability, which cannot exceed 1, there is no upper bound on odds. The natural log of odds is called the logit, or logit transformation, of p: logit(p) = loge(p/q). Logit is sometimes called "log odds." Because of the properties of odds given in the list above, the logit has these properties: If odds(success) = 1, then logit(p) = 0. If odds(success) < 1, then logit(p) < 0. If odds(success) > 1, then logit(p) > 0. The logit transform fails if p = 0. Logistic regression is a method for fitting a regression curve,  y = f(x), when y consists of proportions or probabilities, or binary coded (0,1--failure, success) data. When the response is a binary (dichotomous) variable, and x is numeric, logistic regression fits a logistic curve to the relationship between x and y. The logistic curve looks like the graph at right. It is an S-shaped or sigmoid curve, often used to model population growth, survival from a disease, the spread of a disease (or something disease-like, such as a rumor), and so on. The logistic function is... y = [exp(b0 + b1x)] / [1 + exp(b0 + b1x)] Logistic regression fits b0 and b1, the regression coefficients (which were 0 and 1, respectively, for the graph above). It should have already struck you that this curve is not linear. However, the point of the logit transform is to make it linear. logit(y) = b0 + b1x Hence, logistic regression is linear regression on the logit transform of y, where y is the proportion (or probability) of success at each value of x. However, you should avoid the temptation to do a traditional least-squares regression at this point, as neither the normality nor the homoscedasticity assumption will be met. The odds ratio might best be illustrated by returning to our horse race. Suppose in the same race, Seattle Stew is given odds of 2 to 1 against, which is to say, two expected loses for each expected win. Seattle Stew's odds of winning are 1/2, or 0.5. How much better is this than the winning odds for Sea Brisket? The odds ratio tells us: 0.5 / 0.125 = 4.0. The odds of Seattle Stew winning are four times the odds of Sea Brisket winning. Be careful not to say "times as likely to win," which would not be correct. The probability (likelihood, chance) of Seattle Stew winning is 1/3 and for Sea Brisket is 1/9, resulting in a likelihood ratio of 3.0. Seattle Stew is three times more likely to win than is Sea Brisket. Logistic Regression: One Numeric Predictor In the "MASS" library there is a data set called "menarche" (Milicer, H. and Szczotka, F., 1966, Age at Menarche in Warsaw girls in 1965, Human Biology, 38, 199-203), in which there are three variables: "Age" (average age of age homogeneous groups of girls), "Total" (number of girls in each group), and "Menarche" (number of girls in the group who have reached menarche). ```> library("MASS") > data(menarche) > str(menarche) 'data.frame': 25 obs. of 3 variables: \$ Age : num 9.21 10.21 10.58 10.83 11.08 ... \$ Total : num 376 200 93 120 90 88 105 111 100 93 ... \$ Menarche: num 0 0 0 2 2 5 10 17 16 29 ... > summary(menarche) Age Total Menarche Min. : 9.21 Min. : 88.0 Min. : 0.00 1st Qu.:11.58 1st Qu.: 98.0 1st Qu.: 10.00 Median :13.08 Median : 105.0 Median : 51.00 Mean :13.10 Mean : 156.7 Mean : 92.32 3rd Qu.:14.58 3rd Qu.: 117.0 3rd Qu.: 92.00 Max. :17.58 Max. :1049.0 Max. :1049.00 > plot(Menarche/Total ~ Age, data=menarche)``` From the graph at right, it appears a logistic fit is called for here. The fit would be done this way. `> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age, family=binomial(logit), data=menarche)` Numerous explanation are in order! First, glm() is the function used to do generalized linear models. With "family=" set to "binomial" with a "logit" link, glm() produces a logistic regression. Because we are using glm() with binomial errors in the response variable, the ordinary assumptions of least squares linear regression (normality and homoscedasticity) don't apply. Second, our data frame does not contain a row for every case (i.e., every girl upon whom data were collected). Therefore, we do not have a binary (0,1) coded response variable. No problem! If we feed glm() a table (or matrix) in which the first column is number of successes and the second column is number of failures, R will take care of the coding for us. In the above analysis, we made that table on the fly inside the model formula by binding "Menarche" and "Total − Menarche" into the columns of a table (matrix) using cbind(). Let's look at how closely the fitted values from our logistic regression match the observed values. ```> plot(Menarche/Total ~ Age, data=menarche) > lines(menarche\$Age, glm.out\$fitted, type="l", col="red") > title(main="Menarche Data with Fitted Logistic Regression Line")``` I'm impressed! The numerical results are extracted like this. ```> summary(glm.out) Call: glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), data = menarche) Deviance Residuals: Min 1Q Median 3Q Max -2.0363 -0.9953 -0.4900 0.7780 1.3675 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -21.22639 0.77068 -27.54 <2e-16 *** Age 1.63197 0.05895 27.68 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3693.884 on 24 degrees of freedom Residual deviance: 26.703 on 23 degrees of freedom AIC: 114.76 Number of Fisher Scoring iterations: 4``` The following requests also produce useful results: glm.out\$coef, glm.out\$fitted, glm.out\$resid, glm.out\$effects, and anova(glm.out). Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times." To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean). This is essentially a chi square value on 24 degrees of freedom, and indicates very little fit (a highly significant difference between fitted values and observed values). Adding in our predictors--just "Age" in this case--decreased the deviance by 3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the observed values. Logistic Regression: Multiple Numerical Predictors During the Fall semester of 2005, two students in our program--Rachel Mullet and Lauren Garafola--did a senior research project in which they studied a phenomenon called Inattentional Blindness (IB). IB refers to situations in which a person fails to see an obvious stimulus right in front of his eyes. (For details see this website.) In their study, Rachel and Lauren had subjects view an online video showing college students passing basketballs to each other, and the task was to count the number of times students in white shirts passed the basketball. During the video, a person in a black gorilla suit walked though the picture in a very obvious way. At the end of the video, subjects were asked if they saw the gorilla. Most did not! Rachel and Lauren hypothesized that IB could be predicted from performance on the Stroop Color Word test. This test produces three scores: "W" (word alone, i.e., a score derived from reading a list of color words such as red, green, black), "C" (color alone, in which a score is derived from naming the color in which a series of Xs are printed), and "CW" (the Stroop task, in which a score is derived from the subject's attempt to name the color in which a color word is printed when the word and the color do not agree). The data are in the following table, in which the response, "seen", is coded as 0=no and 1=yes. ```gorilla = read.table(header=T, text=" seen W C CW 0 126 86 64 0 118 76 54 0 61 66 44 0 69 48 32 0 57 59 42 0 78 64 53 0 114 61 41 0 81 85 47 0 73 57 33 0 93 50 45 0 116 92 49 0 156 70 45 0 90 66 48 0 120 73 49 0 99 68 44 0 113 110 47 0 103 78 52 0 123 61 28 0 86 65 42 0 99 77 51 0 102 77 54 0 120 74 53 0 128 100 56 0 100 89 56 0 95 61 37 0 80 55 36 0 98 92 51 0 111 90 52 0 101 85 45 0 102 78 51 1 100 66 48 1 112 78 55 1 82 84 37 1 72 63 46 1 72 65 47 1 89 71 49 1 108 46 29 1 88 70 49 1 116 83 67 1 100 69 39 1 99 70 43 1 93 63 36 1 100 93 62 1 110 76 56 1 100 83 36 1 106 71 49 1 115 112 66 1 120 87 54 1 97 82 41 ")``` We might begin like this. ```> cor(gorilla) # a correlation matrix seen W C CW seen 1.00000000 -0.03922667 0.05437115 0.06300865 W -0.03922667 1.00000000 0.43044418 0.35943580 C 0.05437115 0.43044418 1.00000000 0.64463361 CW 0.06300865 0.35943580 0.64463361 1.00000000``` Or like this. ```> with(gorilla, tapply(W, seen, mean)) 0 1 100.40000 98.89474 > with(gorilla, tapply(C, seen, mean)) 0 1 73.76667 75.36842 > with(gorilla, tapply(CW, seen, mean)) 0 1 46.70000 47.84211``` The Stroop scale scores are moderately positively correlated with each other, but none of them appears to be related to the "seen" response variable, at least not to any impressive extent. There doesn't appear to be much here to look at. Let's have a go at it anyway. Since the response is a binomial variable, a logistic regression can be done as follows. ```> glm.out = glm(seen ~ W * C * CW, family=binomial(logit), data=gorilla) > summary(glm.out) Call: glm(formula = seen ~ W * C * CW, family = binomial(logit), data = gorilla) Deviance Residuals: Min 1Q Median 3Q Max -1.8073 -0.9897 -0.5740 1.2368 1.7362 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.323e+02 8.037e+01 -1.646 0.0998 . W 1.316e+00 7.514e-01 1.751 0.0799 . C 2.129e+00 1.215e+00 1.753 0.0797 . CW 2.206e+00 1.659e+00 1.329 0.1837 W:C -2.128e-02 1.140e-02 -1.866 0.0621 . W:CW -2.201e-02 1.530e-02 -1.439 0.1502 C:CW -3.582e-02 2.413e-02 -1.485 0.1376 W:C:CW 3.579e-04 2.225e-04 1.608 0.1078 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 65.438 on 48 degrees of freedom Residual deviance: 57.281 on 41 degrees of freedom AIC: 73.281 Number of Fisher Scoring iterations: 5 > anova(glm.out, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: seen Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 48 65.438 W 1 0.075 47 65.362 0.784 C 1 0.310 46 65.052 0.578 CW 1 0.106 45 64.946 0.745 W:C 1 2.363 44 62.583 0.124 W:CW 1 0.568 43 62.015 0.451 C:CW 1 1.429 42 60.586 0.232 W:C:CW 1 3.305 41 57.281 0.069``` Two different extractor functions have been used to see the results of our analysis. What do they mean? The first gives us what amount to regression coefficients with standard errors and a z-test, as we saw in the single variable example above. None of the coefficients are significantly different from zero (but a few are close). The deviance was reduced by 8.157 points on 7 degrees of freedom, for a p-value of... ```> 1 - pchisq(8.157, df=7) [1] 0.3189537``` Overall, the model appears to have performed poorly, showing no significant reduction in deviance (no significant difference from the null model). The second print out shows the same overall reduction in deviance, from 65.438 to 57.281 on 7 degrees of freedom. In this print out, however, the reduction in deviance is shown for each term, added sequentially first to last. Of note is the three-way interaction term, which produced a nearly significant reduction in deviance of 3.305 on 1 degree of freedom (p = 0.069). In the event you are encouraged by any of this, the following graph might be revealing. ```> plot(glm.out\$fitted) > abline(v=30.5,col="red") > abline(h=.3,col="green") > abline(h=.5,col="green") > text(15,.9,"seen = 0") > text(40,.9,"seen = 1")``` I'll leave it up to you to interpret this on your own time. (Also, try running the anova() tests on a model in which CW has been entered first.) Logistic Regression: Categorical Predictors Let's re-examine the "UCBAdmissions" data set, which we looked at in a previous tutorial. ```> ftable(UCBAdmissions, col.vars="Admit") Admit Admitted Rejected Gender Dept Male A 512 313 B 353 207 C 120 205 D 138 279 E 53 138 F 22 351 Female A 89 19 B 17 8 C 202 391 D 131 244 E 94 299 F 24 317``` The data are from 1973 and show admissions by gender to the top six grad programs at the University of California, Berkeley. Looked at as a two-way table, there appears to be a bias against admitting women. ```> dimnames(UCBAdmissions) \$Admit [1] "Admitted" "Rejected" \$Gender [1] "Male" "Female" \$Dept [1] "A" "B" "C" "D" "E" "F" > margin.table(UCBAdmissions, c(2,1)) Admit Gender Admitted Rejected Male 1198 1493 Female 557 1278``` However, there are also relationships between "Gender" and "Dept" as well as between "Dept" and "Admit", which means the above relationship may be confounded by "Dept" (or "Dept" might be a lurking variable, in the language of traditional regression analysis). Perhaps a logistic regression with the binomial variable "Admit" as the response can tease these variables apart. If there is a way to conveniently get that flat table into a data frame (without splitting an infinitive), I don't know it. So I had to do this. (I'll leave off the command prompts so you can copy and paste it.) ```### begin copying here ucb.df = data.frame(gender=rep(c("Male","Female"),c(6,6)), dept=rep(LETTERS[1:6],2), yes=c(512,353,120,138,53,22,89,17,202,131,94,24), no=c(313,207,205,279,138,351,19,8,391,244,299,317)) ### end copying here and paste into the R Console > ucb.df gender dept yes no 1 Male A 512 313 2 Male B 353 207 3 Male C 120 205 4 Male D 138 279 5 Male E 53 138 6 Male F 22 351 7 Female A 89 19 8 Female B 17 8 9 Female C 202 391 10 Female D 131 244 11 Female E 94 299 12 Female F 24 317``` Once again, we do not have a binary coded response variable, so the last two columns of this data frame will have to be bound into the columns of a table (matrix) to serve as the response in the model formula. ```> mod.form = "cbind(yes,no) ~ gender * dept" # mind the quotes here! > glm.out = glm(mod.form, family=binomial(logit), data=ucb.df)``` I used a trick here of storing the model formula in a data object, and then entering the name of this object into the glm() function. That way, if I made a mistake in the model formula (or want to run an alternative model), I have only to edit the "mod.form" object to do it. Let's see what we have found. ```> options(show.signif.stars=F) # turn off significance stars (optional) > anova(glm.out, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: cbind(yes, no) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 11 877.06 gender 1 93.45 10 783.61 4.167e-22 dept 5 763.40 5 20.20 9.547e-163 gender:dept 5 20.20 0 -2.265e-14 1.144e-03``` This is a saturated model, meaning we have used up all our degrees of freedom, and there is no residual deviance left over at the end. Saturated models always fit the data perfectly. In this case, it appears the saturated model is required to explain the data adequately. If we leave off the interaction term, for example, we will be left with a residual deviance of 20.2 on 5 degrees of freedom, and the model will be rejected (p = .001144). It appears all three terms are making a significant contribution to the model. How they are contributing appears if we use the other extractor. ```> summary(glm.out) Call: glm(formula = mod.form, family = binomial(logit), data = ucb.df) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.5442 0.2527 6.110 9.94e-10 genderMale -1.0521 0.2627 -4.005 6.21e-05 deptB -0.7904 0.4977 -1.588 0.11224 deptC -2.2046 0.2672 -8.252 < 2e-16 deptD -2.1662 0.2750 -7.878 3.32e-15 deptE -2.7013 0.2790 -9.682 < 2e-16 deptF -4.1250 0.3297 -12.512 < 2e-16 genderMale:deptB 0.8321 0.5104 1.630 0.10306 genderMale:deptC 1.1770 0.2996 3.929 8.53e-05 genderMale:deptD 0.9701 0.3026 3.206 0.00135 genderMale:deptE 1.2523 0.3303 3.791 0.00015 genderMale:deptF 0.8632 0.4027 2.144 0.03206 (Dispersion parameter for binomial family taken to be 1) Null deviance: 8.7706e+02 on 11 degrees of freedom Residual deviance: -2.2649e-14 on 0 degrees of freedom AIC: 92.94 Number of Fisher Scoring iterations: 3``` These are the regression coefficients for each predictor in the model, with the base level of each factor being suppressed. Remember, we are predicting log odds, so to make sense of these coefficients, they need to be "antilogged". ```> exp(-1.0521) # antilog of the genderMale coefficient [1] 0.3492037 > 1/exp(-1.0521) [1] 2.863658``` This shows that men were actually at a significant disadvantage when department and the interaction are controlled. The odds of a male being admitted were only 0.35 times the odds of a female being admitted. The reciprocal of this turns it on its head. All else being equal, the odds of female being admitted were 2.86 times the odds of a male being admitted. (Any excitement we may be experiencing over these "main effects" should be tempered by the fact that we have a significant interaction that needs interpreting.) Each coefficient compares the corresponding predictor to the base level. So... ```> exp(-2.2046) [1] 0.1102946``` ...the odds of being admitted to department C were only about 1/9th the odds of being admitted to department A, all else being equal. If you want to compare, for example, department C to department D, do this. ```> exp(-2.2046) / exp(-2.1662) # C:A / D:A leaves C:D [1] 0.962328``` All else equal, the odds of being admitted to department C were 0.96 times the odds of being admitted to department D. To be honest, I'm not sure I'm comfortable with the interaction in this model. It seems to me the last few calculations we've been doing would make more sense without it. You might want to examine the interaction, and if you think it doesn't merit inclusion, run the model again without it. Try doing that entering "dept" first. ```> mod.form="cbind(yes,no) ~ dept + gender" > glm.out=glm(mod.form, family=binomial(logit), data=ucb.df) > anova(glm.out, test="Chisq")``` Statistics are nice, but in the end it's what makes sense that should rule the day. revised 2016 February 19 | Table of Contents | Function Reference | Function Finder | R Project |