R Tutorials Top Banner

RESAMPLING TECHNIQUES


Caveat emptor

Resampling techniques depend upon repeated (re)randomization or simulation of a sample. Computers do not generate random numbers. Since an algorithm is used to produce the results of a function like runif( ), these results are technically referred to as pseudorandom. I've done a few casual tests of the R random number generator, and it seems to be very good, but I'm not an expert on pseudorandom number generators. So I will begin with a warning: computer intensive resampling is only as good as your pseudorandom number generator.


Monte Carlo Estimation of Power

Although the term "resampling" is often used to refer to any repeated random or pseudorandom sampling simulation, when the "resampling" is done from a known theoretical distribution, the correct term is "Monte Carlo" simulation. I will use such a scheme here to demonstrate how power can be estimated for the two-sample t-test using Student's sleep data...

> data(sleep)
> str(sleep)
'data.frame':   20 obs. of  2 variables:
 $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
 $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
> attach(sleep)
> tapply(extra, group, mean)
   1    2 
0.75 2.33 
> tapply(extra, group, sd)
       1        2 
1.789010 2.002249 
> tapply(extra, group, length)
 1  2 
10 10 
> t.test(extra~group)

        Welch Two Sample t-test

data:  extra by group 
t = -1.8608, df = 17.776, p-value = 0.0794
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -3.3654832  0.2054832 
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33
> power.t.test(n=10, delta=(2.33-.75), sd=1.9, sig.level=.05,
+              type="two.sample", alternative="two.sided")

     Two-sample t test power calculation 

              n = 10
          delta = 1.58
             sd = 1.9
      sig.level = 0.05
          power = 0.4208457
    alternative = two.sided

 NOTE: n is number in *each* group
There is the traditional analysis. R claims this t-test has a power of 0.42.

Supposedly, we have a 42% chance of finding a two-tailed significant difference between two samples of 10 chosen from normal populations with a common standard deviation of 1.9 (the pooled sd of the two samples) but with a true difference in means of 1.58. To do the same calculation by simulation, we will use the rnorm( ) function to draw the samples, the t.test( ) function to get a p-value, and then we will simply look to see what percentage of those p-values are less than alpha=.05 after running this procedure a goodly number of times. But first, a couple explanations.

What we are about to do is, perhaps, best done with a script. That way, if something goes wrong, we don't have to re-enter multiple lines into the R console. However, we will live dangerously for the sake of illustration. To get this to work with minimal effort on our part--always a good thing--we will need to get R to do the same set of calculations many times. This is done with a loop, and in R, loops are most often done using for( ). The syntax is "for (counter in 1 to number of simulations)". The statements to be looped through repeatedly then follow enclosed in curly braces. I feel a bit like I'm trying to describe an elephant to someone who's never seen one before. Perhaps it would be best to just show you the elephant...

> R = 999
> alpha = numeric(R)
> for (i in 1:R) {
+ group1 = rnorm(10, mean=.75, sd=1.9)
+ group2 = rnorm(10, mean=2.33, sd=1.9)
+ alpha[i] = t.test(group1,group2)$p.value
+ }
The first line defines the number of random simulations we want and stores this value in the semi-official data object that R typically uses for such things, namely R. We've done 999 sims, because that's the number that is normally done. The second line creates a numeric vector with 999 elements, which will hold the results of each simulation. The third line begins our for loop: "for i (the counter) going from 1 to 999", do the following stuff. I.e., do the following stuff 999 times, keeping track by incrementing the counter, i, at each step. Group 1 is generated. Group 2 is generated. The t-test is done, and the p-value is extracted and stored in "alpha", our designated storage place. In each simulation, the storage occurs in the ith position of the alpha vector. Now all that remains is to determine how many of those values are less than .05 ...
> mean(alpha<.05)
[1] 0.3983984
To do this, we played a nice little trick. Remember, "alpha<.05" generates a logical vector of TRUEs and FALSEs, in which TRUEs count as ones and FALSEs count as zeros. We took the mean of that vector, which is to say, we calculated by a bit of trickery the proportion of TRUEs. Our simulation tells us we have about a 40% chance of rejecting the null hypothesis under these conditions, pretty close to what we found above. Your results will differ, of course, because we are using a randomizing procedure after all.


Permutation Tests

Oops! I forgot to clean up. So I will use the sleep data to illustrate our next topic as well, which is permutation tests. If you were more on the ball than I am and detached the "sleep" data frame, reattach it now.

The logic behind a permutation test is straightforward. It says, "Okay, these are the twenty (in this example) scores we got, and the way they are divided up between the two groups here is one possible permutation of them. There are, in fact...

> choose(20,10)
[1] 184756
...possible permutations (technically combinations--but same idea) of these data into two groups of ten, and each is equally likely to occur if we were to pick one at random out of a hat. Most of these permutations would give no or little difference between the group means, but a few of them would give large differences. How extreme is the obtained case?" In other words, the logic of the permutation test is quite similar to the logic of the t-test. If the obtained case is in the most extreme 5% of possible results, then we reject the null hypothesis of no difference between the means (assuming we were looking at differences between means). The advantage of the permutation test is it does not make any assumption about normality, and in fact, doesn't make any assumption at all about parent distributions. Furthermore, a permutation test is generally more powerful than a "traditional" nonparametric test.

The disadvantage of a permutation test is the number of permutations that must be generated. Even small samples, as we have here, will choke a typical desktop computer with repetitive calculations. Therefore, instead of generating all possible permutations, we generate only a random sample of them. This procedure is often called a "randomization test" instead of a permutation test. Let's see it, and then I'll explain...

> ls()
[1] "alpha"  "group1" "group2" "i"      "R"      "sleep" 
> rm(alpha,group1,group2,i,R)
> # A little cleaning up never hurts anybody!
> R = 999
> scores = extra
> t.values = numeric(R)
> for (i in 1:R) {
+ index = sample(1:20, size=10, replace=F)
+ group1 = scores[index]
+ group2 = scores[-index]
+ t.values[i] = t.test(group1,group2)$statistic
+ }
First, we cleaned out some objects we wanted to use again and that "we" forgot to clean out when we were done with them last time. Then we set the number of sims (replications, resamplings) to 999. Then we made a copy of the data, which was really unnecessary. Then we established a storage vector for the results of the sims. Then we began looping. Inside the loop, we took a random sample of the vector 1 to 20, without replacement, and stored that in an object called "index". These index values were then used to extract the first resampled group of scores. The second group of scores was extracted using a trick, which in words goes, "for group two take all the scores that are not indexed by the values in index." In other words, put the remaining scores in group two. Then we did the t-test and stored the test statistic (t-value). It now remains to compare those simulated t-values to the one we obtained above when we did the actual t-test...
> t.values = abs(t.values)             # for a two-tailed test
> mean(t.values<=1.8608)               # using the same trick we used above
[1] 0.0920921
The resampling scheme tells us that a little more than 9% of the sims gave us a result as extreme or more extreme than the obtained case. This is the p-value resulting from the randomization test. We can see that it is pretty close to the p-value obtained in the actual t-test above (.0794). The difference may be nothing more than random noise (the standard error is .009), or we may be seeing that the t-test was a bit too generous here, perhaps due to nonnormality.

And this time, let's not forget...

> detach(sleep)
> rm(list=ls())

Bootstrap Resampling

Bootstrap resampling is similar to the above randomization procedure, except the resampling is done WITH replacement. The idea is to consider the pooled sample to be a mini-population of scores. Presumably, we were sampling from a larger parent distribution, but we know nothing about it for certain. The only information we have about the parent distribution is the sample we've obtained from it. Therefore, we will calculate a sampling distribution for the test statistic by resampling from our mini-population WITH replacement, calculating the desired statistic, and taking a look at what we get...

> with(sleep, t.test(extra~group)$statistic)     # for reference
        t 
-1.860813 
> scores = sleep$extra                           # the data
> R = 999                                        # the number of replicates
> t.values = numeric(R)                          # storage for the results
> for (i in 1:R) {
+ group1 = sample(scores, size=10, replace=T)
+ group2 = sample(scores, size=10, replace=T)
+ t.values[i] = t.test(group1,group2)$statistic
+ }
The only real difference here is that we have resampled with "replace=T", thus resampling with replacement from the pooled data set. The sampling distribution of the t-statistic can be visualized in a histogram, and I will plot the obtained t-value as a point on the x-axis...
> hist(t.values, breaks=20)
> points(-1.8608,0, pch=16)
Sampling dist. of t
The simulation p-value is obtained as previously...
> mean(t.values<=-1.8608)              # one-tailed (lower)
[1] 0.04204204
> t.values = abs(t.values)             # two-tailed
> mean(t.values>=1.8608)
[1] 0.08508509
The result is very similar to the randomization method result.

Note: R contains a library called "boot", the main function within which is boot( ). Supposedly, this automates the whole resampling procedure. However, I have found it almost impossible to use, and certainly more difficult than the above methods. All the examples I've seen are much more complex than the problems above and (I might add) poorly explained. I've yet to get the function to produce a meaningful result. When I do, there will be a revision here. If you wish to give it a try, read the help page first, then you might try Canty (2002) and Rizzo (2008). And good luck to you. I hope you have better luck than I have!


Bootstrapped Confidence Intervals

Bootstrap resampling is useful for estimating confidence intervals from samples when the sample is from an unknown (and clearly nonnormal) distribution. The data set "crabs" in the MASS package supplies an example. We will look at the carapace length of blue crabs...

> data(crabs, package="MASS")
> cara = crabs$CL[crabs$sp=="B"]
> summary(cara)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.70   24.85   30.10   30.06   34.60   47.10 
> length(cara)
[1] 100
> qqnorm(cara)
Hmmmm, that could be normal, but who can say for sure? So to get a 95% CI for the mean, we will use a bootstrap scheme rather than methods based on a normal distribution. The procedure is to take a large number of bootstrap samples and to examine the desired statistic...
> R = 999
> boot.means = numeric(R)
> for (i in 1:R) {
+ boot.sample = sample(cara, 100, replace=T)
+ boot.means[i] = mean(boot.sample)
+ }
> quantile(boot.means, c(.025,.975))
    2.5%    97.5% 
28.70265 31.44220
The bootstrapped CI is reasonably close to the one obtained by traditional methods...
> mean(cara)-1.96*sd(cara)/sqrt(length(cara))
[1] 28.70507
> mean(cara)+1.96*sd(cara)/sqrt(length(cara))
[1] 31.41093
Extension to more complex problems is straightforward.


Update: The boot( ) Function

I finally found a more-or-less sensible explanation of the boot( ) function in Crawley (2007). I can't claim anything near "complete understanding", but I'm working on it. Meanwhile, here is a heavily commented example showing how to bootstrap a confidence interval...

> library(boot)
> ### The syntax is: boot(data= , statistic= , R= )
> ### Send it the name of a vector or data frame for "data". "R" is the
> ### number of replications you want. The tricky part is "statistic=",
> ### where you have to send it a function you've written. If you're
> ### going to get fancy, you might want to read the Writing Your Own
> ### Functions tutorial first.
> data(crabs, package="MASS")
> ### Not necessary if you haven't yet cleaned up from the previous section.
> cara = crabs$CL[crabs$sp=="B"]
> ### The carapace lengths of 100 blue crabs. See previous section.
> ### And now we have to write a function to calculate the means of these...
> the.means = function(cara, i) {mean(cara[i])}
> ### Finally, we run the bootstrap...
> boot(data=cara, statistic=the.means, R=999)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = cara, statistic = the.means, R = 999)


Bootstrap Statistics :
    original     bias    std. error
t1*   30.058 0.02926827   0.6820252
You MUST write your own function to calculate the statistic you are bootstrapping, even if that function is already built into R. If it is a built-in function, then writing your own function is a simple matter, as you witnessed above. You must also include the index i, which R uses internally to do the bootstrap. I don't know why. I'm led to believe this makes the boot( ) function much more versatile. Finally, the output is interpreted as so. The value "original" is the mean of the original data vector "cara". The value "bias" gives the difference between "original" and the mean of the bootstrapped values for this statistic. Clearly, "bias" ought to be close to zero. The value "std. error" is the standard deviation of the bootstrapped means. To get a confidence interval, the output of boot( ) should be stored...
> boot(data=cara, statistic=the.means, R=999) -> boot.out
> quantile(boot.out$t, c(.025,.975))
    2.5%    97.5% 
28.73275 31.42255
The result is similar, but not identical, to the result we got in the previous section.

Bootstrapping is more usefully applied to statistics like the median, for which there is no formula for CIs when the distribution is not normal...

> the.medians = function(cara, i) {median(cara[i])}
> boot(data=cara, statistic=the.medians, R=999) -> boot.out2
> boot.out2

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = cara, statistic = the.medians, R = 999)


Bootstrap Statistics :
    original     bias    std. error
t1*     30.1 0.09814815    1.477237

> quantile(boot.out2$t, c(.025,.5,.975))
 2.5%   50% 97.5% 
 27.7  30.1  32.4
I still can't get it to do anything more complex than a simple, one-variable confidence interval though, so obviously some R guru needs to write some COMPREHENSIBLE examples using this procedure to do things that ordinary people might actually want to do!

> detach("package:boot")

Bootstrapping a Oneway ANOVA

Let's do something a little more complex than a confidence interval! Yesterday, I wrote the tutorial on Oneway ANOVA, and in that, I used the data frame "InsectSprays" for an example. However, "InsectSprays" has some severe problems with normality and homogeneity of variance, which means the theoretical F-distribution may not apply. How can we get a sampling distribution that does apply? First, let's use Monte Carlo simulation to reproduce the theoretical distribution...

> data(InsectSprays)                   # look at ?InsectSprays to get the skinny
> with(InsectSprays,tapply(count,spray,mean))
        A         B         C         D         E         F 
14.500000 15.333333  2.083333  4.916667  3.500000 16.666667 
> with(InsectSprays,tapply(count,spray,var))
        A         B         C         D         E         F 
22.272727 18.242424  3.901515  6.265152  3.000000 38.606061 
> with(InsectSprays,tapply(count,spray,length))
 A  B  C  D  E  F 
12 12 12 12 12 12 
> summary(aov(count~spray, data=InsectSprays))
            Df  Sum Sq Mean Sq F value    Pr(>F)    
spray        5 2668.83  533.77  34.702 < 2.2e-16 ***
Residuals   66 1015.17   15.38
Those are some ugly variances! The help page for these data recommend a data transform. Let's assume the pooled variance of 15.38 applies to all groups and get the F-distribution as follows (copy and paste this script into R)...
meanstar = mean(InsectSprays$count)
sdstar = sqrt(15.38)
simspray = InsectSprays$spray
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = rnorm(12, mean=meanstar, sd=sdstar)
groupB = rnorm(12, mean=meanstar, sd=sdstar)
groupC = rnorm(12, mean=meanstar, sd=sdstar)
groupD = rnorm(12, mean=meanstar, sd=sdstar)
groupE = rnorm(12, mean=meanstar, sd=sdstar)
groupF = rnorm(12, mean=meanstar, sd=sdstar)
simcount = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simcount,simspray)
Fstar[i] = oneway.test(simcount~simspray, var.equal=T, data=simdata)$statistic
}
Go get a snack here! Well, that took awhile, didn't it? "Fstar" should now contain our simulated F-distribution for df1=5 and df2=66 degrees of freedom...
> hist(Fstar, prob=T)
> x=seq(.25,5.25,.5)
> points(x,y=df(x,5,66),type="b",col="red")
Fstar
Not too bad!

Now let's do the same, except this time we will use the InsectSprays data to get the Fstar-distribution. We will center all our groups on the same mean (zero), but we will leave the variance and shape of the individual group distributions undisturbed (copy and paste this script into R)...

rm(list=ls())
data(InsectSprays)
meanstar = with(InsectSprays, tapply(count,spray,mean))
grpA = InsectSprays$count[InsectSprays$spray=="A"] - meanstar[1]
grpB = InsectSprays$count[InsectSprays$spray=="B"] - meanstar[2]
grpC = InsectSprays$count[InsectSprays$spray=="C"] - meanstar[3]
grpD = InsectSprays$count[InsectSprays$spray=="D"] - meanstar[4]
grpE = InsectSprays$count[InsectSprays$spray=="E"] - meanstar[5]
grpF = InsectSprays$count[InsectSprays$spray=="F"] - meanstar[6]
simspray = InsectSprays$spray
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=12, replace=T)
groupB = sample(grpB, size=12, replace=T)
groupC = sample(grpC, size=12, replace=T)
groupD = sample(grpD, size=12, replace=T)
groupE = sample(grpE, size=12, replace=T)
groupF = sample(grpF, size=12, replace=T)
simcount = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simcount,simspray)
Fstar[i] = oneway.test(simcount~simspray, var.equal=T, data=simdata)$statistic
}
Wait for it! (That took about a minute and 20 seconds on my computer.) We now have a bootstrapped "F"-distribution in "Fstar" based on equal means (the null hypothesis), but normality and homogeneity are no longer assumed. Let's see what it looks like...
> max(Fstar)
[1] 10.54805
> hist(Fstar, breaks=seq(0,11,.5), ylim=c(0,.7), prob=T)
> x=seq(.25,6.75,.5)
> points(x,y=df(x,5,66),type="b",col="red")
Fstar
It's a bit different from the theoretical distribution in that it appears to be heavier in the tails...
> qf(.95,5,66)
[1] 2.353809
> quantile(Fstar,.95)
     95% 
2.790884
The critical value of F(5,66) at alpha=.05 is 2.35, but from the bootstrapped Fstar distribution it appears to be 2.79. Either way, Fobt=34.7 allows the null to be rejected. It might have been a different story though if the effect had been a marginal one.

If there is an easier way to do that using the boot( ) function, I'd love to hear about it. I can't get it to work!


Return to the Table of Contents