R Tutorials--Counts and Proportions

COUNTS AND PROPORTIONS

Binomial and Poisson Data

Count data--data derived from counting things--are often treated as if they are assumed to be binomial distributed or Poisson distributed. The binomial model applies when the counts are derived from independent Bernoulli trials in which the probability of a "success" is known, and the number of trials (i.e., the maximum possible value of the count) is also known. The classic example is coin tossing. We toss a coin 100 times and count the number of times the coin lands heads side up. The maximum possible count is 100, and the probability of adding one to the count (a "success") on each trial is known to be 0.5 (assuming the coin is fair). Trials are independent; i.e., previous outcomes do not influence the probability of success on current or future trials. Another requirement is that the probability of success remain constant over trials.

Poisson counts are often assumed to occur when the maximum possible count is not known, but is assumed to be large, and the probability of adding one to the count at each moment (trials are often ill defined) is also unknown, but is assumed to be small. A example may help.

Suppose we are counting traffic fatalities during a given month. The maximum possible count is quite high, we can imagine, but in advance we can't say exactly (or even approximately) what it might be. Furthermore, the probability of a fatal accident at any given moment in time is unknown but small. This sounds like it might be a Poisson distributed variable. But is it?

The built in data set "Seatbelts" allows us to have a look at such a count variable. "Seatbelts" is not a data frame, it's a time series, so extracting the counts of fatalities will take a bit of trickery.

> Seatbelts                                 some output not shown
         DriversKilled drivers front rear   kms PetrolPrice VanKilled law
Jan 1969           107    1687   867  269  9059  0.10297181        12   0
Feb 1969            97    1508   825  265  7685  0.10236300         6   0
Mar 1969           102    1507   806  319  9963  0.10206249        12   0
Apr 1969            87    1385   814  407 10955  0.10087330         8   0
...
> deaths = as.vector(Seatbelts[,1])         # all rows of column 1 extracted as a vector
> length(deaths)
[1] 192
> mean(deaths)
[1] 122.8021
> var(deaths)
[1] 644.1386

The vector "deaths" now contains the monthly number of traffic deaths in Great Britain during the 192 months from January 1969 through December 1984. We also have our first indication this is not a Poisson distributed variable. In the Poisson distribution, the mean and the variance have the same value. Let's continue nevertheless.

Next we'll look at a histogram of the "deaths" vector, and plotted over top of that we will put the Poisson distribution with mean = 122.8.

> hist(deaths, breaks=15, prob=T, ylim=c(0,.04))
> lines(x<-seq(65,195,10),dpois(x,lambda=mean(deaths)),lty=2,col="red")

And I'm not gonna lie to you. That took some trial and error! The histogram function asks for 15 break points, turns on density plotting, and sets the y-axis to go from 0 to .04. The poisson density function was plotted using the lines() function. The x-values were generated by using seq() and stored into "x" on the fly. The y-values were generated using the dpois() function. We also requested a dashed, red line. Examination of the figure shows what we suspected. The empirical distribution does not match the theoretical Poisson distribution.

> sim.dist = rpois(192, lambda=mean(deaths))
> qqplot(deaths, sim.dist, main="Q-Q Plot")
> abline(a=0, b=1, lty=2, col="red")

Finally, you may recall (if you read that tutorial) that the qqnorm() function can be used to check a distribution for normality. The qqplot() function will compare any two distributions to see if they have the same shape. If they do, the plotted points will fall along a straight line. The plot to the right doesn't look too terribly bad until we realize these two distributions should not only both be poisson, they should also both have the same mean (or lambda value). Thus, the points should fall along a line with intercept 0 and slope 1. The moral of the story: just because count data sound like they might fit a certain distribution doesn't mean they will. R provides a number of mechanisms for checking this.

We could have saved ourselves a lot of trouble by looking at a time series plot to begin with. I will not reproduce it, but the command for doing so is below. In the plot, we see the data are strongly cyclical and, therefore, that the individual elements of the vector should not be considered independent counts. In fact, there is a yearly cycle in the number of traffic deaths.

> plot(Seatbelts[,1])                  # Not shown.

Furthermore, a scatterplot of "deaths" against its index values appears to show that the probability of dying in a traffic accident is decreasing over the duration of the record.

> scatter.smooth(1:192, deaths)        # Not shown.

So there are lot's of problems with the poisson model here!

The Binomial Test

Suppose we set up a classic card-guessing test for ESP using a 25-card deck of Zener cards, which consists of 5 cards each of 5 different symbols. If the null hypothesis is correct (H₀: no ESP) and the subject is just guessing at random, then we should expect pN correct guesses from N independent Bernoulli trials on which the probability of a success (correct guess) is p = 0.2. Suppose our subject gets 9 correct guesses. Is this out of line with what we should expect just by random chance?

A number of proportion tests could be applied here, but the sample size is fairly small (just 25 guesses), so an exact binomial test is our best choice.

> binom.test(x=9, n=25, p=.2)

        Exact binomial test

data:  9 and 25 
number of successes = 9, number of trials = 25, p-value = 0.07416
alternative hypothesis: true probability of success is not equal to 0.2 
95 percent confidence interval:
 0.1797168 0.5747937 
sample estimates:
probability of success 
                  0.36

Oh, too bad! Assuming we set alpha at the traditional value of .05, we fail to reject the null hypothesis with an obtained p-value of .074. The 95% confidence interval tells us this subject's true rate of correct guessing is best approximated as being between 0.18 and 0.57. This incorporates the null value of 0.2, so once again, we must regard the results as being consistent with the null hypothesis.

We might argue at this point that we should have done a one-tailed test. (A two-tailed test is the default.) Of course, this decision should be made in advance, but if the subject is displaying evidence of ESP, we would expect his success rate to be not just different from chance but greater than chance. To take this into account in the test, we need to set the "alternative=" option. Choices are "less" than expected, "greater" than expected, and "two.sided" (the default).

> binom.test(x=9, n=25, p=.2, alternative="greater")

        Exact binomial test

data:  9 and 25 
number of successes = 9, number of trials = 25, p-value = 0.04677
alternative hypothesis: true probability of success is greater than 0.2 
95 percent confidence interval:
 0.2023778 1.0000000 
sample estimates:
probability of success 
                  0.36

The one-tailed test allows the null to be rejected at alpha=.05. The confidence interval says the subject is guessing with a success rate of at least 0.202. The confidence level can also be set by changing the "conf.level=" option to any reasonable value less than 1. The default value is .95.

The Single-Sample Proportion Test

The subject keeps guessing because, of course, we'd like to see this above chance performance repeated. He has now made 400 passes through the deck for a total of 10,000 independent guesses. He has guessed correctly 2,022 times. What should we conclude?

An exact binomial test is probably not the best choice here as the sample size is now very large. We'll substitute a single-sample proportion test.

> prop.test(x=2022, n=10000, p=.2, alternative="greater")

        1-sample proportions test with continuity correction

data:  2022 out of 10000, null probability 0.2 
X-squared = 0.2889, df = 1, p-value = 0.2955
alternative hypothesis: true p is greater than 0.2 
95 percent confidence interval:
 0.1956252 1.0000000 
sample estimates:
     p 
0.2022

The syntax is exactly the same. Notice the proportion test calculates a chi-squared statistic. The traditional z-test of a proportion is not implemented in R, but the two tests are exactly equivalent. Notice also a correction for continuity is applied. If you don't want it, set the "correct=" option to FALSE. The default value is TRUE. (This option must be set to FALSE to make the test mathematically equivalent to the uncorrected z-test of a proportion.)

Two-Sample Proportions Test

A random sample of 428 adults from Myrtle Beach reveals 128 smokers. A random sample of 682 adults from San Francisco reveals 170 smokers. Is the proportion of adult smokers in Myrtle Beach different from that in San Francisco?

> prop.test(x=c(128,170), n=c(428,682), alternative="two.sided", conf.level=.99)

        2-sample test for equality of proportions with continuity correction

data:  c(128, 170) out of c(428, 682) 
X-squared = 3.0718, df = 1, p-value = 0.07966
alternative hypothesis: two.sided 
99 percent confidence interval:
 -0.02330793  0.12290505 
sample estimates:
   prop 1    prop 2 
0.2990654 0.2492669

The two-proportions test also does a chi-square test with continuity correction, which is mathematically equivalent to the traditional z-test with correction. Enter "hits" or successes into the first vector ("x"), the sample sizes into the second vector ("n"), and set options as you like. To turn off the continuity correction, set "correct=F". I set the alternative to two-sided, but this was unnecessary as two-sided is the default. I also set the confidence level for the confidence interval to 99% to illustrate this option. I made up these data, by the way.

R incorporates a function for calculating the power of a 2-proportions test. The syntax is illustrated here from the help page.

power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05, power = NULL,
                alternative = c("two.sided", "one.sided"),
                strict = FALSE)

The value of n should be set to the sample size per group, p1 and p2 to the group probabilities or proportions of successes, and power to the desired power. One and only one of these options must be passed as NULL, and R will calculate it from the others. In the example above, what sample sizes should we have if we want a power of 90%?

> power.prop.test(p1=.299, p2=.249, sig.level=.05, power=.9,
+                 alternative="two.sided")

     Two-sample comparison of proportions power calculation 

              n = 1670.065
             p1 = 0.299
             p2 = 0.249
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

 NOTE: n is number in *each* group

We would need 1,670 subjects in each group.

Multiple Proportions Test

The two-proportions test generalizes directly to a multiple proportions test. The example from the help page should suffice to illustrate.

> example(prop.test)

> ## Data from Fleiss (1981), p. 139.
> ## H0: The null hypothesis is that the four populations from which
> ##     the patients were drawn have the same true proportion of smokers.
> ## A:  The alternative is that this proportion is different in at
> ##     least one of the populations.
> 
> smokers  <- c( 83, 90, 129, 70 )
> patients <- c( 86, 93, 136, 82 )
> prop.test(smokers, patients)

        4-sample test for equality of proportions without continuity
        correction

data:  smokers out of patients 
X-squared = 12.6004, df = 3, p-value = 0.005585
alternative hypothesis: two.sided 
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.9651163 0.9677419 0.9485294 0.8536585

If you run this example, you will notice that I have deleted some of the output from the simpler proportion tests.

revised 2016 January 27