| Table of Contents | Function Reference | Function Finder | R Project | CHI SQUARE TEST OF INDEPENDENCE Syntax From the help page, the syntax of the chisq.test() function is... ```chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000)``` This function is used for both the goodness of fit test and the test of independence, and which test it does depends upon what kind of data you feed it. If "x" is a numeric vector or a one-dimensional table of numerical values, a goodness of fit test will be done (or attempted), treating "x" as a vector of observed frequencies. If "x" is a 2-D table, array, or matrix, then it is assumed to be a contingency table of frequencies, and a test of independence will be done. Ignore "y". (See below for when to use y.) The "correct=TRUE" option applies the Yates continuity correction when "x" is a 2x2 table. Set this to FALSE if the correction is not desired. For the goodness of fit test, set "p" equal to the null hypothesized proportions or probabilies for each of the categories represented in the vector "x". These must add exactly to 1. For the test of independence, the p= option is irrelevant, as the expected frequencies are calculated from the observed frequencies, x. Textbook Problems The following textbook-like problem uses data from Hand et al. (1994). ``` Senie et al. (1981) investigated the relationship between age and frequency of breast self-examination in a sample of women (Senie, R. T., Rosen, P. P., Lesser, M. L., and Kinne, D. W. Breast self- examinations and medical examination relating to breast cancer stage. American Journal of Public Health, 71, 583-590.) A summary of the results is presented in the following table: Frequency of breast self-examination ------------------------------------ Age Monthly Occasionally Never ------------------------------------------------- under 45 91 90 51 45 - 59 150 200 155 60 and over 109 198 172 ------------------------------------------------ From Hand et al., page 307, table 368.``` The data have already been tabled for us in most textbook problems. We just have to get the data into an R data object. There are several ways to do this. ```> row1 = c(91,90,51) # or col1 = c(91,150,109) > row2 = c(150,200,155) # and col2 = c(90,200,198) > row3 = c(109,198,172) # and col3 = c(51,155,172) > data.table = rbind(row1,row2,row3) # and data.table = cbind(col1,col2,col3) > data.table [,1] [,2] [,3] row1 91 90 51 row2 150 200 155 row3 109 198 172 > chisq.test(data.table) Pearson's Chi-squared test data: data.table X-squared = 25.086, df = 4, p-value = 4.835e-05``` If the data are available in an electronic document, like this one, it can be entered into R using the scan() function. ```> ls() [1] "data.table" "row1" "row2" "row3" > rm(list=ls()) # Clean up a bit first! > the.data = scan() 1: 91 90 51 4: 150 200 155 7: 109 198 172 10: Read 9 items > the.data # scan produces a vector [1] 91 90 51 150 200 155 109 198 172 > data.matrix = matrix(the.data, byrow=T, nrow=3) > data.matrix # which we convert to a matrix [,1] [,2] [,3] [1,] 91 90 51 [2,] 150 200 155 [3,] 109 198 172 > chisq.test(data.matrix)\$statistic # keeping the output brief X-squared 25.08597``` The data table can be copied and pasted one row at a time into the scan() function. The result will be a vector in which the data are entered row-wise. A matrix is then created using the matrix() function, which expects the data vector to contain column-wise data, i.e., 91, 150, 109, 90, ..., 172. This behavior can be changed with the "byrow=T" option. The "nrow=" option specifies how many rows to create in the matrix. If the problem asks for further data summarization, you will probably want to give names to the rows and columns of the data table first, since this will make R output easier to read and especially make barplots much easier to construct. ```> dimnames(data.matrix) = list(Age=c("lt.45",45.to.59","ge.60"), Error: unexpected symbol in "dimnames(data.matrix) = list(Age=c("lt.45",45.to.59"``` And here, in all fairness to those who would prefer a hunt-and-click interface, I suppose I should interpose a comment. This error drove me nuts for a good ten minutes or more. It even prompted me to go back and look at a previous tutorial on data entry. To use a command line interface, you do have to have an eye for minutiae. Do you see the mistake? It took me awhile to spot it! I left out an open quote. The tip off should have been that R stopped outputting the error message at the closed quote that had no matching open quote. Let's see if I can get it right this time. (The old days! I was a relative newbie then. After years of R-ing, I know what to look for now. I would have spotted that missing quote in a twinkling. That's my story and I'm sticking to it!) ```> dimnames(data.matrix) = list(Age=c("lt.45","45.to.59","ge.60"), + Freq=c("Monthly","Occasionally","Never")) > data.matrix Freq Age Monthly Occasionally Never lt.45 91 90 51 45.to.59 150 200 155 ge.60 109 198 172 > ### At last! > ### Examine marginal distributions... > addmargins(data.matrix) Freq Age Monthly Occasionally Never Sum lt.45 91 90 51 232 45.to.59 150 200 155 505 ge.60 109 198 172 479 Sum 350 488 378 1216 > ### Examine conditional distributions... > prop.table(data.matrix, 1) # relative to margin 1, the row margin Freq Age Monthly Occasionally Never lt.45 0.3922414 0.3879310 0.2198276 45.to.59 0.2970297 0.3960396 0.3069307 ge.60 0.2275574 0.4133612 0.3590814 > ### And finally a barplot... > barplot(data.matrix, beside=T) # output not shown > ### Okay, a better barplot... (you could type this into a script window) > barplot(t(data.matrix), beside=T, legend=T, ylim=c(0,250), # t() for transpose + ylab="Observed frequencies in sample", + main="Frequency of Breast Self-Examination by Age")``` In a future tutorial, we'll learn how to move the legend to a more appealing location on the graph. Data From a Table Object You may remember the following data object, which was used in a couple previous tutorials. ```> UCBAdmissions # output not shown > ftable(UCBAdmissions, row.vars=c("Admit")) # a more compact version Gender Male Female Dept A B C D E F A B C D E F Admit Admitted 512 353 120 138 53 22 89 17 202 131 94 24 Rejected 313 207 205 279 138 351 19 8 391 244 299 317 > round(prop.table(ftable(UCBAdmissions, row.vars=c("Admit")),2),2) Gender Male Female Dept A B C D E F A B C D E F Admit Admitted 0.62 0.63 0.37 0.33 0.28 0.06 0.82 0.68 0.34 0.35 0.24 0.07 Rejected 0.38 0.37 0.63 0.67 0.72 0.94 0.18 0.32 0.66 0.65 0.76 0.93``` And don't think for a moment it didn't take me a little trial and error to get the last version of the data table! Suppose we want a chi square test of Admit by Gender in Dept. B only. I'll begin by extracting just those data (although that is technically unnecessary). ```> deptB = UCBAdmissions[,,2] # all rows, all cols, but only of layer 2 > deptB Gender Admit Male Female Admitted 353 17 Rejected 207 8 > chisq.test(deptB) Pearson's Chi-squared test with Yates' continuity correction data: deptB X-squared = 0.0851, df = 1, p-value = 0.7705 > chisq.test(deptB)\$expected Gender Admit Male Female Admitted 354.1880 15.811966 Rejected 205.8120 9.188034``` No significant relationship is found between Admit and Gender in Dept. B, as might have been expected from examination of the data table. I also took a look at the expected frequencies, just because I could, although it's not so crucial to do so when the null hypothesis is retained. Notice that Yate's correction for continuity is applied by default when a 2x2 table is entered into the chi square procedure. This behavior can be turned off with the option "correct=F". ```> chisq.test(deptB, correct=F) Pearson's Chi-squared test data: deptB X-squared = 0.2537, df = 1, p-value = 0.6145``` Data From a Data Frame The data frame "survey" in the "MASS" package contains responses from 237 statistics students to a series of questions, some of which resulted in categorical answers. For one item, the students were asked to fold their arms, and the recorded result was which arm was on top: right, left, or neither. Let's see how this variable relates to the sex of the student. ```> data(survey, package="MASS") > attach(survey) > table(Sex, Fold) Fold Sex L on R Neither R on L Female 48 6 64 Male 50 12 56``` In this case, the two vectors of interest, "Sex" and "Fold", both contain categorical values on a case by case basis. You could table them first (as we just did), store this table in a data object, and enter the name of this data object into the chisq.test() function, but in cases like these, R allows a simpler syntax. ```> chisq.test(x=Sex, y=Fold) # simply chisq.test(Sex, Fold) will do Pearson's Chi-squared test data: Sex and Fold X-squared = 2.5741, df = 2, p-value = 0.2761``` As long as the two data vectors, X and Y, are categorical, and they are arranged such that Xi corresponds on a case by case basis to Yi, you need only to enter the names of the two data vectors into the chi square test function. R will do the implied crosstabulation for you. `> detach(survey) # Don't forget!` Alternatives When the EFs Are Small The following data are from a Stanford University study of the effectiveness of the antidepressant Celexa in the treatment of compulsive shopping. These data were found in Verzani (2005, p.262f), and the analysis here is similar to his. ``` outcome ----------------- treat worse same better --------------------------- Celexa 2 3 7 placebo 2 8 2 --------------------------- > freqs = c(2, 2, 3, 8, 7, 2) # entered "down the columns" this time > data.matrix = matrix(freqs, nrow=2) # fills down columns by default > dimnames(data.matrix) = list(treatment=c("Celexa","placebo"), + outcome=c("worse","same","better")) > data.matrix outcome treatment worse same better Celexa 2 3 7 placebo 2 8 2 > chisq.test(data.matrix) Pearson's Chi-squared test data: data.matrix X-squared = 5.0505, df = 2, p-value = 0.08004 Warning message: In chisq.test(data.matrix) : Chi-squared approximation may be incorrect > chisq.test(data.matrix)\$expected outcome treatment worse same better Celexa 2 5.5 4.5 placebo 2 5.5 4.5 Warning message: In chisq.test(data.matrix) : Chi-squared approximation may be incorrect``` The chi square procedure produces a warning message (somewhat cryptic) when any of the expected frequencies fall below 5. In this case, most of them do, which should lead us to be concerned about the accuracy of any p-value calculated from the chi-squared distribution. Since the p-value is close to .05, we might want to try a more accurate method. What can we do? There are two choices: a Fisher Exact Test, and a p-value calculated by Monte Carlo simulation. Let's look at the Fisher Exact Test first. You may be thinking, "Um, doesn't the Fisher Exact Test work only for 2x2 tables?" Yes, and when a 2x2 table is supplied, R will give you the exact p-value calculated from the hypergeometric distribution. However, in R, the Fisher Exact Test has been extended to work with larger tables, provided the obtained frequencies are not too large. You should see the help page for the details of how this has been done. ```> fisher.test(data.matrix) Fisher's Exact Test for Count Data data: data.matrix p-value = 0.07303 alternative hypothesis: two.sided``` The Fisher Exact Test suggests that the chi square procedure gave us a fairly accurate result in spite of the low expected frequencies. And don't get your feathers in a ruffle here, because although there is an "alternative=" option for this test, it works ONLY for 2x2 tables, in which case R calculates and tests the odds ratio for significance. You might want to investigate the Fisher Exact Test further on your own, as there is additional output when a 2x2 table is supplied. The other alternative is to simulate the sampling distribution of the test statistic (in this case, chi squared) using Monte Carlo methods. (To find out more about this, read the Resampling Techniques tutorial.) Fortunately, the chisq.test() function incorporates an option that automates this complex procedure for us. ```> chisq.test(data.matrix, simulate.p.value=T, B=999) Pearson's Chi-squared test with simulated p-value (based on 999 replicates) data: data.matrix X-squared = 5.0505, df = NA, p-value = 0.111``` The "simulate.p.value=T" option (default value is FALSE) does the Monte Carlo simulation using "B=999" (default value is B=2000) replicates. It doesn't look like we can fish up a signficant relationship here! revised 2016 January 27 | Table of Contents | Function Reference | Function Finder | R Project |