R Tutorials Top Banner

LOGISTIC REGRESSION


Preliminaries

Model Formulae

You will need to know a bit about Model Formulae to understand this tutorial.

Odds, Odds Ratios, and Logit

When you go to the track, how do you know which horse to bet on? You look at the odds. In the program, you may see the odds for your horse, Sea Brisket, are 8 to 1, which are the odds AGAINST winning. This means in nine races Sea Brisket would be expected to win 1 and lose 8. In probability terms, Sea Brisket has a probability of winning of 1/9, or 0.111. But the odds of winning are 1/8, or 0.125. Odds are actually the ratio of two probabilities...

          p(one outcome)       p(success)    p
odds = -------------------- = ----------- = ---, where q = 1 - p
       p(the other outcome)    p(failure)    q
So for Sea Brisket, odds(winning) = (1/9)/(8/9) = 1/8. Notice that odds have these properties:

  • If p(success) = p(failure), then odds(success) = 1 (or 1 to 1, or 1:1).
  • If p(success) < p(failure), then odds(success) < 1.
  • If p(success) > p(failure), then odds(success) > 1.
  • Unlike probability, which cannot excede 1, there is no upper bound on odds.
Logistic Curve

The natural log of odds is called the logit, or logit transformation, of p: logit(p) = loge(p/q). Logit is sometimes called "log odds." Because of the properties of odds given in the list above, the logit has these properties:

  • If odds(success) = 1, then logit(p) = 0.
  • If odds(success) < 1, then logit(p) < 0.
  • If odds(success) > 1, then logit(p) > 0.
  • The logit transform fails if p = 0.

Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1--failure,success) data. When the response is a binary (dichotomous) variable, and x is numerical, logistic regression fits a logistic curve to the relationship between x and y. The logistic curve looks like the graph at right. It is an S-shaped or sigmoid curve, often used to model population growth, survival from a disease, the spread of a disease (or something disease-like, such as a rumor), and so on. The logistic function is...

y = [exp(b0 + b1x)] / [1 + exp(b0 + b1x)]

Logistic regression fits b0 and b1, the regression coefficients (which were 0 and 1, respectively, for the graph above). It should have already struck you that this curve is not linear. However, the point of the logit transform is to make it linear...

logit(y) = b0 + b1x

Hence, logistic regression is linear regression on the logit transform of y, where y is the proportion (or probability) of success at each value of x. However, you should avoid the temptation to do a traditional least-squares regression at this point, as neither the normality nor the homoscedasticity assumption will be met.

Odds ratio might best be illustrated by returning to our horse race. Suppose in the same race Seattle Stew is given odds of 2 to 1, which is to say, two expected loses for each expected win. Seattle Stew's odds of winning are 1/2, or 0.5. How much better is this than the winning odds for Sea Brisket? The odds ratio tells us: 0.5 / 0.125 = 4.0. The odds of Seattle Stew winning are four times the odds of Sea Brisket winning. Be careful not to say "times as likely to win," which would not be correct. The probability (likelihood, chance) of Seattle Stew winning is 1/3 and for Sea Brisket is 1/9, resulting in a likelihood ratio of 3.0. Seattle Stew is three times more likely to win than is Sea Brisket.


Logistic Regression: One Numerical Predictor

In the "MASS" library there is a data set called "menarche" (Milicer, H. and Szczotka, F., 1966, Age at Menarche in Warsaw girls in 1965, Human Biology, 38, 199-203), in which there are three variables: "Age" (average age of age homogeneous groups of girls), "Total" (number of girls in each group), and "Menarche" (number of girls in the group who have reached menarche)... Menarche Graph

> library("MASS")
> data(menarche)
> str(menarche)
'data.frame':   25 obs. of  3 variables:
 $ Age     : num   9.21 10.21 10.58 10.83 11.08 ...
 $ Total   : num  376 200 93 120 90 88 105 111 100 93 ...
 $ Menarche: num  0 0 0 2 2 5 10 17 16 29 ...
> summary(menarche)
      Age            Total           Menarche      
 Min.   : 9.21   Min.   :  88.0   Min.   :   0.00  
 1st Qu.:11.58   1st Qu.:  98.0   1st Qu.:  10.00  
 Median :13.08   Median : 105.0   Median :  51.00  
 Mean   :13.10   Mean   : 156.7   Mean   :  92.32  
 3rd Qu.:14.58   3rd Qu.: 117.0   3rd Qu.:  92.00  
 Max.   :17.58   Max.   :1049.0   Max.   :1049.00
> plot(Menarche/Total ~ Age, data=menarche)
From the graph at right, it appears a logistic fit is called for here. The fit would be done this way...
> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
+               family=binomial(logit), data=menarche)
Numerous explanation are in order! First, glm( ) is the function used to do generalized linear models, and will be explained more completely in another tutorial. With "family=" set to "binomial" with a "logit" link, glm( ) produces a logistic regression. Because we are using glm( ) with binomial errors in the response variable, the ordinary assumptions of least squares linear regression (normality and homoscedasticity) don't apply. Second, our data frame does not contain a row for every case (i.e., every girl upon whom data were collected). Therefore, we do not have a binary (0,1) coded response variable. No problem! If we feed glm( ) a table (or matrix) in which the first column is number of successes and the second column is number of failures, R will take care of the coding for us. In the above analysis, we made that table on the fly inside the model formula by binding "Menarche" and "Total − Menarche" into the columns of a table using cbind( ).

Let's look at how closely the fitted values from our logistic regression match the observed values...

> plot(Menarche/Total ~ Age, data=menarche)
> lines(menarche$Age, glm.out$fitted, type="l", col="red")
> title(main="Menarche Data with Fitted Logistic Regression Line")
Logistic Result
I'm impressed! I don't know about you. The numerical results are extracted like this...
> summary(glm.out)

Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), 
    data = menarche)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0363  -0.9953  -0.4900   0.7780   1.3675  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -21.22639    0.77068  -27.54   <2e-16 ***
Age           1.63197    0.05895   27.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3693.884  on 24  degrees of freedom
Residual deviance:   26.703  on 23  degrees of freedom
AIC: 114.76

Number of Fisher Scoring iterations: 4
The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and anova(glm.out).

Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."

To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean). This is essentially a chi square value on 24 degrees of freedom, and indicates very little fit (a highly significant difference between fitted values and observed values). Adding in our predictors--just "Age" in this case--decreased the deviance by 3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the observed values.


Logistic Regression: Multiple Numerical Predictors

During the Fall semester of 2005, two students in our program--Rachel Mullet and Lauren Garafola--did a senior research project in which they studied a phenomenon called Inattentional Blindness (IB). IB refers to situations in which a person fails to see an obvious stimulus right in front of his eyes. (For details see this website.) In their study, Rachel and Lauren had subjects view an online video showing college students passing basketballs to each other, and the task was to count the number of times students in white shirts passed the basketball. During the video, a person in a black gorilla suit walked though the picture in a very obvious way. At the end of the video, subjects were asked if they saw the gorilla. Most did not!

Rachel and Lauren hypothesized that IB could be predicted from performance on the Stroop Color Word test. This test produces three scores: "W" (word alone, i.e., a score derived from reading a list of color words such as red, green, black), "C" (color alone, in which a score is derived from naming the color in which a series of Xs are printed), and "CW" (the Stroop task, in which a score is derived from the subject's attempt to name the color in which a color word is printed when the word and the color do not agree). The data are in the following table, in which the response, "seen", is coded as 0=no and 1=yes...

   seen   W   C CW
1     0 126  86 64
2     0 118  76 54
3     0  61  66 44
4     0  69  48 32
5     0  57  59 42
6     0  78  64 53
7     0 114  61 41
8     0  81  85 47
9     0  73  57 33
10    0  93  50 45
11    0 116  92 49
12    0 156  70 45
13    0  90  66 48
14    0 120  73 49
15    0  99  68 44
16    0 113 110 47
17    0 103  78 52
18    0 123  61 28
19    0  86  65 42
20    0  99  77 51
21    0 102  77 54
22    0 120  74 53
23    0 128 100 56
24    0 100  89 56
25    0  95  61 37
26    0  80  55 36
27    0  98  92 51
28    0 111  90 52
29    0 101  85 45
30    0 102  78 51
31    1 100  66 48
32    1 112  78 55
33    1  82  84 37
34    1  72  63 46
35    1  72  65 47
36    1  89  71 49
37    1 108  46 29
38    1  88  70 49
39    1 116  83 67
40    1 100  69 39
41    1  99  70 43
42    1  93  63 36
43    1 100  93 62
44    1 110  76 56
45    1 100  83 36
46    1 106  71 49
47    1 115 112 66
48    1 120  87 54
49    1  97  82 41
To get them into R, try this first...
> file = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/gorilla.csv"
> read.csv(file) -> gorilla
> str(gorilla)
'data.frame':   49 obs. of  4 variables:
 $ seen: int  0 0 0 0 0 0 0 0 0 0 ...
 $ W   : int  126 118 61 69 57 78 114 81 73 93 ...
 $ C   : int  86 76 66 48 59 64 61 85 57 50 ...
 $ CW  : int  64 54 44 32 42 53 41 47 33 45 ...
If that doesn't work (and it should), try copying and pasting this script into R at the command prompt...
### Begin copying here.
gorilla = data.frame(rep(c(0,1),c(30,19)),
            c(126,118,61,69,57,78,114,81,73,93,116,156,90,120,99,113,103,123,
              86,99,102,120,128,100,95,80,98,111,101,102,100,112,82,72,72,
              89,108,88,116,100,99,93,100,110,100,106,115,120,97),
            c(86,76,66,48,59,64,61,85,57,50,92,70,66,73,68,110,78,61,65,
              77,77,74,100,89,61,55,92,90,85,78,66,78,84,63,65,71,46,70,
              83,69,70,63,93,76,83,71,112,87,82),
            c(64,54,44,32,42,53,41,47,33,45,49,45,48,49,44,47,52,28,42,51,54,
              53,56,56,37,36,51,52,45,51,48,55,37,46,47,49,29,49,67,39,43,36,
              62,56,36,49,66,54,41))
colnames(gorilla) = c("seen","W","C","CW")
str(gorilla)
### End copying here.
And if that doesn't work, well, you know what you have to do!

We might begin like this...

> cor(gorilla)                         ### a correlation matrix
            seen           W          C         CW
seen  1.00000000 -0.03922667 0.05437115 0.06300865
W    -0.03922667  1.00000000 0.43044418 0.35943580
C     0.05437115  0.43044418 1.00000000 0.64463361
CW    0.06300865  0.35943580 0.64463361 1.00000000
...or like this...
> with(gorilla, tapply(W, seen, mean))
        0         1 
100.40000  98.89474 
> with(gorilla, tapply(C, seen, mean))
       0        1 
73.76667 75.36842 
> with(gorilla, tapply(CW, seen, mean))
       0        1 
46.70000 47.84211
The Stroop scale scores are moderately positively correlated with each other, but none of them appears to be related to the "seen" response variable, at least not to any impressive extent. There doesn't appear to be much here to look at. Let's have a go at it anyway.

Since the response is a binomial variable, a logistic regression can be done as follows...

> glm.out = glm(seen ~ W * C * CW, family=binomial(logit), data=gorilla)
> summary(glm.out)

Call:
glm(formula = seen ~ W * C * CW, family = binomial(logit), data = gorilla)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8073  -0.9897  -0.5740   1.2368   1.7362  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -1.323e+02  8.037e+01  -1.646   0.0998 .
W            1.316e+00  7.514e-01   1.751   0.0799 .
C            2.129e+00  1.215e+00   1.753   0.0797 .
CW           2.206e+00  1.659e+00   1.329   0.1837  
W:C         -2.128e-02  1.140e-02  -1.866   0.0621 .
W:CW        -2.201e-02  1.530e-02  -1.439   0.1502  
C:CW        -3.582e-02  2.413e-02  -1.485   0.1376  
W:C:CW       3.579e-04  2.225e-04   1.608   0.1078  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 65.438  on 48  degrees of freedom
Residual deviance: 57.281  on 41  degrees of freedom
AIC: 73.281

Number of Fisher Scoring iterations: 5

> anova(glm.out, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: seen

Terms added sequentially (first to last)


       Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                      48     65.438          
W       1    0.075        47     65.362     0.784
C       1    0.310        46     65.052     0.578
CW      1    0.106        45     64.946     0.745
W:C     1    2.363        44     62.583     0.124
W:CW    1    0.568        43     62.015     0.451
C:CW    1    1.429        42     60.586     0.232
W:C:CW  1    3.305        41     57.281     0.069
Two different extractor functions have been used to see the results of our analysis. What do they mean?

The first gives us what amount to regression coefficients with standard errors and a z-test, as we saw in the single variable example above. None of the coefficients are significantly different from zero (but a few are close). The deviance was reduced by 8.157 points on 7 degrees of freedom, for a p-value of...

> 1 - pchisq(8.157, df=7)
[1] 0.3189537
Overall, the model appears to have performed poorly, showing no significant reduction in deviance (no significant difference from the null model).

Gorilla Fitted

The second print out shows the same overall reduction in deviance, from 65.438 to 57.281 on 7 degrees of freedom. In this print out, however, the reduction in deviance is shown for each term, added sequentially first to last. Of note is the three-way interaction term, which produced a nearly significant reduction in deviance of 3.305 on 1 degree of freedom (p = 0.069).

In the event you are encouraged by any of this, the following graph might be revealing...

> plot(glm.out$fitted)
> abline(v=30.5,col="red")
> abline(h=.3,col="green")
> abline(h=.5,col="green")
> text(15,.9,"seen = 0")
> text(40,.9,"seen = 1")

I'll leave it up to you to interpret this on your own time.


Logistic Regression: Categorical Predictors

Let's re-examine the "UCBAdmissions" data set, which we looked at in a previous tutorial...

> ftable(UCBAdmissions, col.vars="Admit")
            Admit Admitted Rejected
Gender Dept                        
Male   A               512      313
       B               353      207
       C               120      205
       D               138      279
       E                53      138
       F                22      351
Female A                89       19
       B                17        8
       C               202      391
       D               131      244
       E                94      299
       F                24      317
The data are from 1973 and show admissions by gender to the top six grad programs at the University of California, Berkeley. Looked at as a two-way table, there appears to be a bias against admitting women...
> dimnames(UCBAdmissions)
$Admit
[1] "Admitted" "Rejected"

$Gender
[1] "Male"   "Female"

$Dept
[1] "A" "B" "C" "D" "E" "F"

> margin.table(UCBAdmissions, c(2,1))
        Admit
Gender   Admitted Rejected
  Male       1198     1493
  Female      557     1278
However, there are also relationships between "Gender" and "Dept" as well as between "Dept" and "Admit", which means the above relationship may be confounded by "Dept" (or "Dept" might be a lurking variable, in the language of traditional regression analysis). Perhaps a logistic regression with the binomial variable "Admit" as the response can tease these variables apart.

If there is a way to conveniently get that flat table into a data frame (without splitting an infinitive), I don't know it. So I had to do this...

> ucb.df = data.frame(gender=rep(c("Male","Female"),c(6,6)),
+                     dept=rep(LETTERS[1:6],2),
+                     yes=c(512,353,120,138,53,22,89,17,202,131,94,24),
+                     no=c(313,207,205,279,138,351,19,8,391,244,299,317))
> ucb.df
   gender dept yes  no
1    Male    A 512 313
2    Male    B 353 207
3    Male    C 120 205
4    Male    D 138 279
5    Male    E  53 138
6    Male    F  22 351
7  Female    A  89  19
8  Female    B  17   8
9  Female    C 202 391
10 Female    D 131 244
11 Female    E  94 299
12 Female    F  24 317
Once again, we do not have a binary coded response variable, so the last two columns of this data frame will have to be bound into the columns of a table to serve as the response in the model formula...
> mod.form = "cbind(yes,no) ~ gender * dept"
> glm.out = glm(mod.form, family=binomial(logit), data=ucb.df)
I used a trick here of storing the model formula in a data object, and then entering the name of this object into the glm( ) function. That way, if I made a mistake in the model formula (or want to run an alternative model), I have only to edit the "mod.form" object to do it.

Let's see what we have found...

> options(show.signif.stars=F)         # turn off significance stars (optional)
> anova(glm.out, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: cbind(yes, no)

Terms added sequentially (first to last)


            Df Deviance Resid. Df  Resid. Dev   P(>|Chi|)
NULL                           11      877.06            
gender       1    93.45        10      783.61   4.167e-22
dept         5   763.40         5       20.20  9.547e-163
gender:dept  5    20.20         0  -2.265e-14   1.144e-03
This is a saturated model, meaning we have used up all our degrees of freedom, and there is no residual deviance left over at the end. Saturated models always fit the data perfectly. In this case, it appears the saturated model is required to explain the data adequately. If we leave off the interaction term, for example, we will be left with a residual deviance of 20.2 on 5 degrees of freedom, and the model will be rejected (p = .001144). It appears all three terms are making a significant contribution to the model.

How they are contributing appears if we use the other extractor...

> summary(glm.out)

Call:
glm(formula = mod.form, family = binomial(logit), data = ucb.df)

Deviance Residuals: 
 [1]  0  0  0  0  0  0  0  0  0  0  0  0

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)        1.5442     0.2527   6.110 9.94e-10
genderMale        -1.0521     0.2627  -4.005 6.21e-05
deptB             -0.7904     0.4977  -1.588  0.11224
deptC             -2.2046     0.2672  -8.252  < 2e-16
deptD             -2.1662     0.2750  -7.878 3.32e-15
deptE             -2.7013     0.2790  -9.682  < 2e-16
deptF             -4.1250     0.3297 -12.512  < 2e-16
genderMale:deptB   0.8321     0.5104   1.630  0.10306
genderMale:deptC   1.1770     0.2996   3.929 8.53e-05
genderMale:deptD   0.9701     0.3026   3.206  0.00135
genderMale:deptE   1.2523     0.3303   3.791  0.00015
genderMale:deptF   0.8632     0.4027   2.144  0.03206

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  8.7706e+02  on 11  degrees of freedom
Residual deviance: -2.2649e-14  on  0  degrees of freedom
AIC: 92.94

Number of Fisher Scoring iterations: 3
These are the regression coefficients for each predictor in the model, with the base level of each factor being suppressed. Remember, we are predicting log odds, so to make sense of these coefficients, they need to be "antilogged"...
> exp(-1.0521)                         # antilog of the genderMale coefficient
[1] 0.3492037
> 1/exp(-1.0521)
[1] 2.863658
This shows that men were actually at a significant disadvantage when department and the interaction are controlled. The odds of a male being admitted were only 0.35 times the odds of a female being admitted. The reciprocal of this turns it on its head. All else being equal, the odds of female being admitted were 2.86 times the odds of a male being admitted.

Each coefficient compares the corresponding predictor to the base level. So...

> exp(-2.2046)
[1] 0.1102946
...the odds of being admitted to department C were only about 1/9th the odds of being admitted to department A, all else being equal. If you want to compare, for example, department C to department D, do this...
> exp(-2.2046) / exp(-2.1662)          # C:A / D:A leaves C:D
[1] 0.962328
All else equal, the odds of being admitted to department C were 0.96 times the odds of being admitted to department D. (To be honest, I'm not sure I'm comfortable with the interaction in this model. You might want to examine the interaction, and if you think it doesn't merit inclusion, run the model again without it. Statistics are nice, but in the end it's what makes sense that should rule the day.)


Return to the Table of Contents