R Tutorials--Chi Square Test of Independence

CHI SQUARE TEST OF INDEPENDENCE

Syntax

From the help page, the syntax of the chisq.test() function is...

chisq.test(x, y = NULL, correct = TRUE,
           p = rep(1/length(x), length(x)), rescale.p = FALSE,
           simulate.p.value = FALSE, B = 2000)

This function is used for both the goodness of fit test and the test of independence, and which test it does depends upon what kind of data you feed it. If "x" is a numeric vector or a one-dimensional table of numerical values, a goodness of fit test will be done (or attempted), treating "x" as a vector of observed frequencies. If "x" is a 2-D table, array, or matrix, then it is assumed to be a contingency table of frequencies, and a test of independence will be done.

Ignore "y". (See below for when to use y.) The "correct=TRUE" option applies the Yates continuity correction when "x" is a 2x2 table. Set this to FALSE if the correction is not desired. For the goodness of fit test, set "p" equal to the null hypothesized proportions or probabilies for each of the categories represented in the vector "x". These must add exactly to 1. For the test of independence, the p= option is irrelevant, as the expected frequencies are calculated from the observed frequencies, x.

Textbook Problems

The following textbook-like problem uses data from Hand et al. (1994).

       Senie et al. (1981) investigated the relationship between age and
       frequency of breast self-examination in a sample of women (Senie,
       R. T., Rosen, P. P., Lesser, M. L., and Kinne, D. W. Breast self-
       examinations and medical examination relating to breast cancer
       stage. American Journal of Public Health, 71, 583-590.)
       A summary of the results is presented in the following table:

                    Frequency of breast self-examination
                    ------------------------------------
       Age          Monthly      Occasionally      Never
       -------------------------------------------------
       under 45         91            90             51
       45 - 59         150           200            155
       60 and over     109           198            172
       ------------------------------------------------
       From Hand et al., page 307, table 368.

The data have already been tabled for us in most textbook problems. We just have to get the data into an R data object. There are several ways to do this.

> row1 = c(91,90,51)                   # or col1 = c(91,150,109)
> row2 = c(150,200,155)                # and col2 = c(90,200,198)
> row3 = c(109,198,172)                # and col3 = c(51,155,172)
> data.table = rbind(row1,row2,row3)   # and data.table = cbind(col1,col2,col3)
> data.table
     [,1] [,2] [,3]
row1   91   90   51
row2  150  200  155
row3  109  198  172
> chisq.test(data.table)

        Pearson's Chi-squared test

data:  data.table 
X-squared = 25.086, df = 4, p-value = 4.835e-05

If the data are available in an electronic document, like this one, it can be entered into R using the scan() function.

> ls()
[1] "data.table" "row1"       "row2"       "row3"      
> rm(list=ls())                             # Clean up a bit first!
> the.data = scan()
1:   91            90             51
4:  150           200            155
7:  109           198            172
10: 
Read 9 items
> the.data                                  # scan produces a vector
[1]  91  90  51 150 200 155 109 198 172
> data.matrix = matrix(the.data, byrow=T, nrow=3)
> data.matrix                               # which we convert to a matrix
     [,1] [,2] [,3]
[1,]   91   90   51
[2,]  150  200  155
[3,]  109  198  172
> chisq.test(data.matrix)$statistic         # keeping the output brief
X-squared 
 25.08597

The data table can be copied and pasted one row at a time into the scan() function. The result will be a vector in which the data are entered row-wise. A matrix is then created using the matrix() function, which expects the data vector to contain column-wise data, i.e., 91, 150, 109, 90, ..., 172. This behavior can be changed with the "byrow=T" option. The "nrow=" option specifies how many rows to create in the matrix.

If the problem asks for further data summarization, you will probably want to give names to the rows and columns of the data table first, since this will make R output easier to read and especially make barplots much easier to construct.

> dimnames(data.matrix) = list(Age=c("lt.45",45.to.59","ge.60"),
Error: unexpected symbol in "dimnames(data.matrix) = list(Age=c("lt.45",45.to.59"

And here, in all fairness to those who would prefer a hunt-and-click interface, I suppose I should interpose a comment. This error drove me nuts for a good ten minutes or more. It even prompted me to go back and look at a previous tutorial on data entry. To use a command line interface, you do have to have an eye for minutiae. Do you see the mistake? It took me awhile to spot it! I left out an open quote. The tip off should have been that R stopped outputting the error message at the closed quote that had no matching open quote. Let's see if I can get it right this time. (The old days! I was a relative newbie then. After years of R-ing, I know what to look for now. I would have spotted that missing quote in a twinkling. That's my story and I'm sticking to it!)

> dimnames(data.matrix) = list(Age=c("lt.45","45.to.59","ge.60"),
+                              Freq=c("Monthly","Occasionally","Never"))
> data.matrix
          Freq
Age        Monthly Occasionally Never
  lt.45         91           90    51
  45.to.59     150          200   155
  ge.60        109          198   172
> ### At last!
> ### Examine marginal distributions...
> addmargins(data.matrix)
          Freq
Age        Monthly Occasionally Never  Sum
  lt.45         91           90    51  232
  45.to.59     150          200   155  505
  ge.60        109          198   172  479
  Sum          350          488   378 1216
> ### Examine conditional distributions...
> prop.table(data.matrix, 1)                # relative to margin 1, the row margin
          Freq
Age          Monthly Occasionally     Never
  lt.45    0.3922414    0.3879310 0.2198276
  45.to.59 0.2970297    0.3960396 0.3069307
  ge.60    0.2275574    0.4133612 0.3590814
> ### And finally a barplot...
> barplot(data.matrix, beside=T)           # output not shown
> ### Okay, a better barplot... (you could type this into a script window)
> barplot(t(data.matrix), beside=T, legend=T, ylim=c(0,250),         # t() for transpose
+         ylab="Observed frequencies in sample",
+         main="Frequency of Breast Self-Examination by Age")

In a future tutorial, we'll learn how to move the legend to a more appealing location on the graph.

Data From a Table Object

You may remember the following data object, which was used in a couple previous tutorials.

> UCBAdmissions                                  # output not shown
> ftable(UCBAdmissions, row.vars=c("Admit"))     # a more compact version
         Gender Male                     Female                    
         Dept      A   B   C   D   E   F      A   B   C   D   E   F
Admit                                                              
Admitted         512 353 120 138  53  22     89  17 202 131  94  24
Rejected         313 207 205 279 138 351     19   8 391 244 299 317
> round(prop.table(ftable(UCBAdmissions, row.vars=c("Admit")),2),2)
         Gender Male                          Female                         
         Dept      A    B    C    D    E    F      A    B    C    D    E    F
Admit                                                                        
Admitted        0.62 0.63 0.37 0.33 0.28 0.06   0.82 0.68 0.34 0.35 0.24 0.07
Rejected        0.38 0.37 0.63 0.67 0.72 0.94   0.18 0.32 0.66 0.65 0.76 0.93

And don't think for a moment it didn't take me a little trial and error to get the last version of the data table!

Suppose we want a chi square test of Admit by Gender in Dept. B only. I'll begin by extracting just those data (although that is technically unnecessary).

> deptB = UCBAdmissions[,,2]           # all rows, all cols, but only of layer 2
> deptB
          Gender
Admit      Male Female
  Admitted  353     17
  Rejected  207      8
> chisq.test(deptB)

        Pearson's Chi-squared test with Yates' continuity correction

data:  deptB 
X-squared = 0.0851, df = 1, p-value = 0.7705

> chisq.test(deptB)$expected
          Gender
Admit          Male    Female
  Admitted 354.1880 15.811966
  Rejected 205.8120  9.188034

No significant relationship is found between Admit and Gender in Dept. B, as might have been expected from examination of the data table. I also took a look at the expected frequencies, just because I could, although it's not so crucial to do so when the null hypothesis is retained. Notice that Yate's correction for continuity is applied by default when a 2x2 table is entered into the chi square procedure. This behavior can be turned off with the option "correct=F".

> chisq.test(deptB, correct=F)

        Pearson's Chi-squared test

data:  deptB 
X-squared = 0.2537, df = 1, p-value = 0.6145

Data From a Data Frame

The data frame "survey" in the "MASS" package contains responses from 237 statistics students to a series of questions, some of which resulted in categorical answers. For one item, the students were asked to fold their arms, and the recorded result was which arm was on top: right, left, or neither. Let's see how this variable relates to the sex of the student.

> data(survey, package="MASS")
> attach(survey)
> table(Sex, Fold)
        Fold
Sex      L on R Neither R on L
  Female     48       6     64
  Male       50      12     56

In this case, the two vectors of interest, "Sex" and "Fold", both contain categorical values on a case by case basis. You could table them first (as we just did), store this table in a data object, and enter the name of this data object into the chisq.test() function, but in cases like these, R allows a simpler syntax.

> chisq.test(x=Sex, y=Fold)            # simply chisq.test(Sex, Fold) will do

        Pearson's Chi-squared test

data:  Sex and Fold 
X-squared = 2.5741, df = 2, p-value = 0.2761

As long as the two data vectors, X and Y, are categorical, and they are arranged such that X_i corresponds on a case by case basis to Y_i, you need only to enter the names of the two data vectors into the chi square test function. R will do the implied crosstabulation for you.

> detach(survey)                       # Don't forget!

Alternatives When the EFs Are Small

The following data are from a Stanford University study of the effectiveness of the antidepressant Celexa in the treatment of compulsive shopping. These data were found in Verzani (2005, p.262f), and the analysis here is similar to his.

                     outcome
                 -----------------
       treat     worse same better
       ---------------------------
       Celexa      2     3     7
       placebo     2     8     2
       ---------------------------

> freqs = c(2, 2, 3, 8, 7, 2)               # entered "down the columns" this time
> data.matrix = matrix(freqs, nrow=2)       # fills down columns by default
> dimnames(data.matrix) = list(treatment=c("Celexa","placebo"),
+                              outcome=c("worse","same","better"))
> data.matrix
         outcome
treatment worse same better
  Celexa      2    3      7
  placebo     2    8      2

> chisq.test(data.matrix)

        Pearson's Chi-squared test

data:  data.matrix 
X-squared = 5.0505, df = 2, p-value = 0.08004

Warning message:
In chisq.test(data.matrix) : Chi-squared approximation may be incorrect

> chisq.test(data.matrix)$expected
         outcome
treatment worse same better
  Celexa      2  5.5    4.5
  placebo     2  5.5    4.5
Warning message:
In chisq.test(data.matrix) : Chi-squared approximation may be incorrect

The chi square procedure produces a warning message (somewhat cryptic) when any of the expected frequencies fall below 5. In this case, most of them do, which should lead us to be concerned about the accuracy of any p-value calculated from the chi-squared distribution. Since the p-value is close to .05, we might want to try a more accurate method. What can we do?

There are two choices: a Fisher Exact Test, and a p-value calculated by Monte Carlo simulation. Let's look at the Fisher Exact Test first.

You may be thinking, "Um, doesn't the Fisher Exact Test work only for 2x2 tables?" Yes, and when a 2x2 table is supplied, R will give you the exact p-value calculated from the hypergeometric distribution. However, in R, the Fisher Exact Test has been extended to work with larger tables, provided the obtained frequencies are not too large. You should see the help page for the details of how this has been done.

> fisher.test(data.matrix)

        Fisher's Exact Test for Count Data

data:  data.matrix 
p-value = 0.07303
alternative hypothesis: two.sided

The Fisher Exact Test suggests that the chi square procedure gave us a fairly accurate result in spite of the low expected frequencies. And don't get your feathers in a ruffle here, because although there is an "alternative=" option for this test, it works ONLY for 2x2 tables, in which case R calculates and tests the odds ratio for significance. You might want to investigate the Fisher Exact Test further on your own, as there is additional output when a 2x2 table is supplied.

The other alternative is to simulate the sampling distribution of the test statistic (in this case, chi squared) using Monte Carlo methods. (To find out more about this, read the Resampling Techniques tutorial.) Fortunately, the chisq.test() function incorporates an option that automates this complex procedure for us.

> chisq.test(data.matrix, simulate.p.value=T, B=999)

        Pearson's Chi-squared test with simulated p-value (based on 999
        replicates)

data:  data.matrix 
X-squared = 5.0505, df = NA, p-value = 0.111

The "simulate.p.value=T" option (default value is FALSE) does the Monte Carlo simulation using "B=999" (default value is B=2000) replicates. It doesn't look like we can fish up a signficant relationship here!

revised 2016 January 27