From the help page, the syntax of the chisq.test( ) function is...
chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000)This function is used for both the goodness of fit test and the test of independence, and which it does depends upon what kind of data you feed it. If "x" is a numerical vector or a one-dimensional table of numerical values, a goodness of fit test will be done (or attempted), treating "x" as a vector of observed frequencies. If "x" is a 2-D table, array, or matrix, then it is assumed to be a contingency table of frequencies, and a test of independence will be done. Ignore "y". (See the help page if you must know.) The "correct=T" option applies the Yates continuity correction when "x" is a 2x2 table. Set this to FALSE if the correction is not desired. For the goodness of fit test, set "p" equal to the null hypothesized proportions or probabilies for each of the categories represented in the vector "x". In the examples, I'll also illustrate how this vector can be set to the expected frequencies. The default is equal frequencies or probabilities. The remaining options can be ignored.
"Professor Bumblefuss takes a random sample of students enrolled in Statistics 101 at ABC University. He finds the following: there are 25 freshman in the sample, 32 sophomores, 18 juniors, and 20 seniors. Test the null hypothesis that freshman, sophomores, juniors, and seniors are equally represented among students signed up for Stat 101." This is a goodness of fit test with equal expected frequencies. The "p"
vector does not need to be specified, since equal frequencies is the default...
> chisq.test(c(25,32,18,20)) Chi-squared test for given probabilities data: c(25, 32, 18, 20) X-squared = 4.9158, df = 3, p-value = 0.1781You could also have begun by assigning the observed frequencies to a vector, and then have used the vector name as "x"... > ofs <- c(25,32,18,20) > chisq.test(ofs) Chi-squared test for given probabilities data: ofs X-squared = 4.9158, df = 3, p-value = 0.1781Either way, the null hypothesis cannot be rejected at alpha = .05. "Professor Iconoclast argues that Bumblefuss is wrong, and that the number of freshman and sophomores enrolled is each twice the number of juniors and the number of seniors. Test Iconoclast's hypothesis from these same data." Now the expected frequencies are no longer equal, so a "p" vector must be
specified. A little algebra leads us to the expected probabilities of 1/3, 1/3,
1/6, and 1/6. Since these numbers are awkward to enter as decimal numbers, they are best entered as fractions, letting R do the arithmetic for us...
> null.probs <- c(1/3,1/3,1/6,1/6) > chisq.test(ofs, p=null.probs) Chi-squared test for given probabilities data: ofs X-squared = 2.8, df = 3, p-value = 0.4235A warning: since R does not expect the "p" vector to come second inside the chisq.test( ) function, it MUST be given its correct name, "p=". It doesn't appear that we can reject Iconoclast's null hypothesis either. The chi-square distribution is a continuous probability distribution, which
is being used here as an approximation to the discrete case. To be an accurate
approximation, the expected frequencies must be sufficiently large. R will
issue a warning message if any of the EFs fall below 5, but it's good practice
to check them nevertheless. This leads us to a device we will use often in
future tutorials--storing the output of an R procedure to a data object and
then extracting the information we want. Usually, the printed output of an R
procedure is only a small part of what R has calculated. Here's how to see
more...
> chisq.test(ofs, p=null.probs) -> results > results Chi-squared test for given probabilities data: ofs X-squared = 2.8, df = 3, p-value = 0.4235 > results$expected [1] 31.66667 31.66667 15.83333 15.83333 > results$residuals [1] -1.18469776 0.05923489 0.54451008 1.04713477The "results" object now contains a list of the stuff R has calculated. To see what is in the list, see the section headed "Value" on the help page for the chisq.test( ) function. The query "results$expected" showed the expected frequencies under the null hypothesis. The query "results$residuals" showed the unsquared chi values in each cell, allowing us to see where the biggest deviations are from frequencies the null predicted. One more example will demonstrate another trick for specifying expected
frequencies. "Last year, the same study was done on the Stat 101 class, and
frequencies observed were: 30 freshman, 28 sophomores, 28 juniors, and 11
seniors. Test to see if this year's results matched last years, to within the
limits of random sampling error." It's an awkward sample size, N = 97,
so a good way to handle the expected probabilities is like this...
> chisq.test(ofs, p=c(30,28,28,11)/97) Chi-squared test for given probabilities data: ofs X-squared = 12.5575, df = 3, p-value = 0.005698Looks like we're way off from what was observed last year.
The object entered into chisq.test( ) for "x=" must be a vector of
observed frequencies. If the data are in some other form, a vector must be
extracted somehow. I will use the built-in data set "HairEyeColor" to
demonstrate. This is a 3-D table that crosstabulates hair color, eye color, and
sex for 592 statistics students. Suppose we are interested only in eye color...
> dimnames(HairEyeColor) $Hair [1] "Black" "Brown" "Red" "Blond" $Eye [1] "Brown" "Blue" "Hazel" "Green" $Sex [1] "Male" "Female" > margin.table(HairEyeColor, 2) -> eyes > eyes Eye Brown Blue Hazel Green 220 215 93 64Suppose a certain physiologist proposes that 50% of people should have brown eyes, 25% blue eyes, 15% hazel eyes, and 10% green eyes. (I don't know why. I made these numbers up!) Does the sample at hand agree with the physiologist's hypothesis... > chisq.test(eyes, p=(.5, .25, .15, .1)) Error: unexpected ',' in "chisq.test(eyes, p=(.5," > > # Okay, you have to remember how to create a vector! Sheesh! > > chisq.test(eyes, p=c(.5, .25, .15, .1)) Chi-squared test for given probabilities data: eyes X-squared = 50.4324, df = 3, p-value = 6.462e-11The physiologist's hypothesis is not supported.
If a data frame contains categorical data in one variable, it can be
extracted into a 1-D frequency table using the table( ) function...
> data(survey,package="MASS") > str(survey) 'data.frame': 237 obs. of 12 variables: $ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ... $ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ... $ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ... $ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ... $ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ... $ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ... $ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ... $ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ... $ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ... $ Height: num 173 178 NA 160 165 ... $ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ... $ Age : num 18.2 17.6 16.9 20.3 23.7 ... > table(survey$Smoke) Heavy Never Occas Regul 11 189 19 17 > table(survey$Smoke)->smokers > smokers Heavy Never Occas Regul 11 189 19 17These are the responses of 237 statistics students at the University of Adelaide to a number of questions. The "Smoke" variable contains their responses to a question about smoking habits, and resulted in each student being classified into one of four categories: Never, Occasional, Regular, and Heavy. We have extracted these data and have placed them into a 1-D table of frequencies called "smokers". This table can now be subjected to any reasonable (or even unreasonable) goodness of fit test. For example, we can test the null hypothesis that 70% of statistics students are nonsmokers, and that the other 30% are divided equally among the remaining categories. The only thing we have to be careful of is that our null probabilities are given in the same order as the frequencies are given in the "smokers" table... > chisq.test(smokers, p=c(.1, .7, .1, .1)) Chi-squared test for given probabilities data: smokers X-squared = 12.8983, df = 3, p-value = 0.004862 > chisq.test(smokers, p=c(.1, .7, .1, .1))$resid Heavy Never Occas Regul -2.593669 1.851706 -0.946895 -1.358588The hypothesis is rejected. It appears in particular that we have very much overestimated the percentage of "Heavy" smokers. Two things. One, the last part of this example shows that we do not need to store the output of the procedure to request residuals. All we need to do is attach "$resid" on the end of the command. We could have done the same with "$expected" or any other element of the outputted results. Also demonstrated is the convenient fact that, if we are too lazy to type the entire word "residuals," we can use just enough of it that R is able to recognize what we are asking for. |