R Tutorials Top Banner

COUNTS AND PROPORTIONS


Binomial and Poisson Data

Count data--data derived from counting things--are often treated as if they are assumed to be binomial distributed or Poisson distributed. The binomial model applies when the counts are derived from independent Bernoulli trials in which the probability of a "success" is known, and the number of trials (i.e., the maximum possible value of the count) is also known. The classic example is coin tossing. We toss a coin 100 times and count the number of times the coin lands heads side up. The maximum possible count is 100, and the probability of adding one to the count (a "success") on each trial is known to be 0.5 (assuming the coin is fair). Trials are independent; i.e., previous outcomes do not influence the probability of success on current or future trials. Another requirement is that the probability of success remain constant over trials.

Poisson counts are often assumed to occur when the maximum possible count is not known, but is assumed to be large, and the probability of adding one to the count at each moment (trials are often ill defined) is also unknown, but is assumed to be small. A example may help.

Suppose we are counting traffic fatalities during a given month. The maximum possible count is quite high, we can imagine, but in advance we can't say exactly (or even approximately) what it might be. Furthermore, the probability of a fatal accident at any given moment in time is unknown but small. This sounds like it might be a Poisson distributed variable. But is it?

The built in data set "Seatbelts" allows us to have a look at such a count variable. "Seatbelts" is not a data frame, it's a time series, so extracting the counts of fatalities will take a bit of trickery...

> deaths = as.vector(Seatbelts[,1])    # All rows of column 1 extracted.
> length(deaths)
[1] 192
> mean(deaths)
[1] 122.8021
> var(deaths)
[1] 644.1386
The vector "deaths" now contains the monthly number of traffic deaths in Great Britain during the 192 months from January 1969 through December 1984. We also have our first indication this is not a Poisson distributed variable. In the Poisson distribution, the mean and the variance are the same. Let's continue nevertheless.

Next we'll look at a histogram of the "deaths" vector, and plotted over top of that we will put the Poisson distribution with mean = 122.8...

> hist(deaths, breaks=15, prob=T, ylim=c(0,.04))
> lines(x<-seq(65,195,10),dpois(x,lambda=mean(deaths)),lty=2,col="red")
And I'm not gonna lie to you. That took some trial and error! The histogram function asks for 15 break points, turns on density plotting, and sets the y-axis to go from 0 to .04. The poisson density function was plotted using the lines( ) function. The x-values were generated by using seq( ) and stored into "x" on the fly. The y-values were generated using the dpois( ) function. We also requested a dashed, red line. Examination of the figure shows what we suspected. The empirical distribution does not match the theoretical Poisson distribution.

Deaths histogram
QQPlot
> sim.dist = rpois(192, lambda=mean(deaths))
> qqplot(deaths, sim.dist, main="Q-Q Plot")
> abline(a=0, b=1, lty=2, col="red")

Finally, you may recall (if you read that tutorial) the qqnorm( ) function can be used to check a distribution for normality. The qqplot( ) function will compare any two distributions to see if they have the same shape. If they do, the plotted points will fall along a straight line. The plot to the right doesn't look too terribly bad until we realize these two distributions should not only both be poisson, they should also both have the same mean (or lambda value). Thus, the points should fall along a line with intercept 0 and slope 1. The moral of the story: just because count data sound like they might fit a certain distribution doesn't mean they will. R provides a number of mechanisms for checking this.

We could have saved ourselves a lot of trouble by looking at a time series plot to begin with. I will not reproduce it, but the command for doing so is below. In the plot, we see the data are strongly cyclical and, therefore, that the individual elements of the vector should not be considered independent counts. In fact, there is a yearly cycle in the number of traffic deaths. (How would you show that?)

> plot(Seatbelts[,1])                  # Not shown.
Furthermore, a scatterplot of "deaths" against its index values appears to show that the probability of dying in a traffic accident is decreasing over the duration of the record...
> scatter.smooth(1:192, deaths)        # Not shown.
Lot's of problems here!


The Binomial Test

Suppose we set up a classic card-guessing test for ESP using a 25-card deck of Zener cards, which consists of 5 cards each of 5 different symbols. If the null hypothesis is correct (H0: no ESP) and the subject is just guessing at random, then we should expect pN correct guesses from N independent Bernoulli trials on which the probability of a success (correct guess) is p = 0.2. Suppose our subject gets 9 correct guesses. Is this out of line with what we should expect just by random chance?

A number of proportion tests could be applied here, but the sample size is fairly small (just 25 guesses), so an exact binomial test is our best choice...

> binom.test(x=9, n=25, p=.2)

        Exact binomial test

data:  9 and 25 
number of successes = 9, number of trials = 25, p-value = 0.07416
alternative hypothesis: true probability of success is not equal to 0.2 
95 percent confidence interval:
 0.1797168 0.5747937 
sample estimates:
probability of success 
                  0.36
Oh, too bad! Assuming we set alpha at the traditional value of .05, we fail to reject the null hypothesis with an obtained p-value of .074. The 95% confidence interval tells us this subject's true rate of correct guessing is best approximated as being between 0.18 and 0.57. This incorporates the null value of 0.2, so once again, we must regard the results as being consistent with the null hypothesis.

We might argue at this point that we should have done a one-tailed test. (A two-tailed test is the default.) Of course, this decision should be made in advance, but if the subject is displaying evidence of ESP, we would expect his success rate to be not just different from chance but greater than chance. To take this into account in the test, we need to set the "alternative=" option. Choices are "less", "greater", and "two.sided" (the default)...

> binom.test(x=9, n=25, p=.2, alternative="greater")

        Exact binomial test

data:  9 and 25 
number of successes = 9, number of trials = 25, p-value = 0.04677
alternative hypothesis: true probability of success is greater than 0.2 
95 percent confidence interval:
 0.2023778 1.0000000 
sample estimates:
probability of success 
                  0.36
The one-tailed test allows the null to be rejected at alpha=.05. The confidence interval says the subject is guessing with a success rate of at least 0.202. The confidence level can also be set by changing the "conf.level=" option to any reasonable value less than 1. The default value is .95.


The Single-Sample Proportion Test

The subject keeps guessing because, of course, we'd like to see this above chance performance repeated. He has now made 400 passes through the deck for a total of 10,000 independent guesses. He has guessed correctly 2,022 times. What should we conclude?

An exact binomial test is probably not the best choice here as the sample size is now very large. We'll substitute a single-sample proportion test...

> prop.test(x=2022, n=10000, p=.2, alternative="greater")

        1-sample proportions test with continuity correction

data:  2022 out of 10000, null probability 0.2 
X-squared = 0.2889, df = 1, p-value = 0.2955
alternative hypothesis: true p is greater than 0.2 
95 percent confidence interval:
 0.1956252 1.0000000 
sample estimates:
     p 
0.2022
The syntax is exactly the same. Notice the proportion test calculates a chi-squared statistic. The traditional z-test of a proportion is not implemented in R, but the two tests are exactly equivalent. Notice also a correction for continuity is applied. If you don't want it, set the "correct=" option to FALSE. The default value is TRUE. (This value must be set to FALSE to make the test mathematically equivalent to the uncorrected z-test of a proportion.)


Two-Sample Proportions Test

A random sample of 428 adults from Myrtle Beach reveals 128 smokers. A random sample of 682 adults from San Francisco reveals 170 smokers. Is the proportion of adult smokers in Myrtle Beach different from that in San Francisco?

> prop.test(x=c(128,170), n=c(428,682),
+           alternative="two.sided",
+           conf.level=.99)

        2-sample test for equality of proportions with continuity correction

data:  c(128, 170) out of c(428, 682) 
X-squared = 3.0718, df = 1, p-value = 0.07966
alternative hypothesis: two.sided 
99 percent confidence interval:
 -0.02330793  0.12290505 
sample estimates:
   prop 1    prop 2 
0.2990654 0.2492669
Don't be upset by the fact that I typed this on multiple lines by hitting the Enter key at convenient spots. Being neat is optional! The two-proportions test also does a chi-square test with continuity correction, which is mathematically equivalent to the traditional z-test with correction. Enter "hits" or successes into the first vector ("x"), the sample sizes into the second vector ("n"), and set options as you like. To turn off the continuity correction, set "correct=F". I set the alternative to two-sided, but this was unnecessary as two-sided is the default. I also set the confidence level for the confidence interval to 99% to illustrate this option. I made up these data, by the way.

R incorporates a function for calculating the power of a 2-proportions test. The syntax is illustrated here from the help page...

power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05, power = NULL,
                alternative = c("two.sided", "one.sided"),
                strict = FALSE)
The value of n should be set to the sample size per group, p1 and p2 to the group probabilities or proportions of successes, and power to the desired power. One and only one of these options must be passed as NULL, and R will calculate it from the others. In the example above, what sample sizes should we have if we want a power of 90%?
> power.prop.test(p1=.299, p2=.249, sig.level=.05, power=.9,
+                 alternative="two.sided")

     Two-sample comparison of proportions power calculation 

              n = 1670.065
             p1 = 0.299
             p2 = 0.249
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

 NOTE: n is number in *each* group
We would need 1,670 subjects in each group.


Multiple Proportions Test

The two-proportions test generalizes directly to a multiple proportions test. The example from the help page should suffice to illustrate...

> example(prop.test)

> ## Data from Fleiss (1981), p. 139.
> ## H0: The null hypothesis is that the four populations from which
> ##     the patients were drawn have the same true proportion of smokers.
> ## A:  The alternative is that this proportion is different in at
> ##     least one of the populations.
> 
> smokers  <- c( 83, 90, 129, 70 )
> patients <- c( 86, 93, 136, 82 )
> prop.test(smokers, patients)

        4-sample test for equality of proportions without continuity
        correction

data:  smokers out of patients 
X-squared = 12.6004, df = 3, p-value = 0.005585
alternative hypothesis: two.sided 
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.9651163 0.9677419 0.9485294 0.8536585
If you run this example, you will notice that I have deleted some of the output from the simpler proportion tests.


revised 2010 August 6
Return to the Table of Contents