
RESAMPLING TECHNIQUES
Caveat emptor
Resampling techniques depend upon repeated (re)randomization or simulation
of a sample. Computers do not generate random numbers. Since an algorithm is
used to produce the results of a function like runif( ), these results are
technically referred to as pseudorandom. I've done a few casual tests of the R
random number generator, and it seems to be very good, but I'm not an expert on
pseudorandom number generators. So I will begin with a warning: computer
intensive resampling is only as good as your pseudorandom number generator.
Monte Carlo Estimation of Power
Although the term "resampling" is often used to refer to any repeated random
or pseudorandom sampling simulation, when the "resampling" is done from a known
theoretical distribution, the correct term is "Monte Carlo" simulation. I will
use such a scheme here to demonstrate how power can be estimated for the
two-sample t-test using Student's sleep data...
> data(sleep)
> str(sleep)
'data.frame': 20 obs. of 2 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
> attach(sleep)
> tapply(extra, group, mean)
1 2
0.75 2.33
> tapply(extra, group, sd)
1 2
1.789010 2.002249
> tapply(extra, group, length)
1 2
10 10
> t.test(extra~group)
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.0794
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
> power.t.test(n=10, delta=(2.33-.75), sd=1.9, sig.level=.05,
+ type="two.sample", alternative="two.sided")
Two-sample t test power calculation
n = 10
delta = 1.58
sd = 1.9
sig.level = 0.05
power = 0.4208457
alternative = two.sided
NOTE: n is number in *each* group
There is the traditional analysis. R claims this t-test has a power of 0.42.
Supposedly, we have a 42% chance of finding a two-tailed significant
difference between two samples of 10 chosen from normal populations with a
common standard deviation of 1.9 (the pooled sd of the two samples) but with a
true difference in means of 1.58. To do the same calculation by simulation, we
will use the rnorm( ) function to draw the samples, the t.test( )
function to get a p-value, and then we will simply look to see what percentage
of those p-values are less than alpha=.05 after running this procedure a goodly
number of times. But first, a couple explanations.
What we are about to do is, perhaps, best done with a script. That way, if
something goes wrong, we don't have to re-enter multiple lines into the R
console. However, we will live dangerously for the sake of illustration. To
get this to work with minimal effort on our part--always a good thing--we will
need to get R to do the same set of calculations many times. This is done with
a loop, and in R, loops are most often done using for( ). The syntax is
"for (counter in 1 to number of simulations)". The statements to be looped
through repeatedly then follow enclosed in curly braces. I feel a bit like I'm
trying to describe an elephant to someone who's never seen one before. Perhaps
it would be best to just show you the elephant...
> R = 999
> alpha = numeric(R)
> for (i in 1:R) {
+ group1 = rnorm(10, mean=.75, sd=1.9)
+ group2 = rnorm(10, mean=2.33, sd=1.9)
+ alpha[i] = t.test(group1,group2)$p.value
+ }
The first line defines the number of random simulations we want and stores this
value in the semi-official data object that R typically uses for such things,
namely R. We've done 999 sims, because that's the number that is normally done.
The second line creates a numeric vector with 999 elements, which will hold the
results of each simulation. The third line begins our for loop: "for i (the
counter) going from 1 to 999", do the following stuff. I.e., do the following
stuff 999 times, keeping track by incrementing the counter, i, at each
step. Group 1 is generated. Group 2 is generated. The t-test is done, and the
p-value is extracted and stored in "alpha", our designated storage place. In
each simulation, the storage occurs in the ith position of the alpha
vector. Now all that remains is to determine how many of those values are less
than .05 ...
> mean(alpha<.05)
[1] 0.3983984
To do this, we played a nice little trick. Remember, "alpha<.05" generates a
logical vector of TRUEs and FALSEs, in which TRUEs count as ones and FALSEs
count as zeros. We took the mean of that vector, which is to say, we calculated
by a bit of trickery the proportion of TRUEs. Our simulation tells us we have
about a 40% chance of rejecting the null hypothesis under these conditions,
pretty close to what we found above. Your results will differ, of course,
because we are using a randomizing procedure after all.
Permutation Tests
Oops! I forgot to clean up. So I will use the sleep data to illustrate our
next topic as well, which is permutation tests. If you were more on the ball
than I am and detached the "sleep" data frame, reattach it now.
The logic behind a permutation test is straightforward. It says, "Okay, these
are the twenty (in this example) scores we got, and the way they are divided up
between the two groups here is one possible permutation of them. There are, in
fact...
> choose(20,10)
[1] 184756
...possible permutations (technically combinations--but same idea) of these
data into two groups of ten, and each is equally likely to occur if we were to
pick one at random out of a hat. Most of these permutations would give no or
little difference between the group means, but a few of them would give large
differences. How extreme is the obtained case?" In other words, the logic of
the permutation test is quite similar to the logic of the t-test. If the
obtained case is in the most extreme 5% of possible results, then we reject the
null hypothesis of no difference between the means (assuming we were looking at
differences between means). The advantage of the permutation test is it does
not make any assumption about normality, and in fact, doesn't make any
assumption at all about parent distributions. Furthermore, a permutation test
is generally more powerful than a "traditional" nonparametric test.
The disadvantage of a permutation test is the number of permutations that
must be generated. Even small samples, as we have here, will choke a typical
desktop computer with repetitive calculations. Therefore, instead of generating
all possible permutations, we generate only a random sample of them. This
procedure is often called a "randomization test" instead of a permutation test.
Let's see it, and then I'll explain...
> ls()
[1] "alpha" "group1" "group2" "i" "R" "sleep"
> rm(alpha,group1,group2,i,R)
> # A little cleaning up never hurts anybody!
> R = 999
> scores = extra
> t.values = numeric(R)
> for (i in 1:R) {
+ index = sample(1:20, size=10, replace=F)
+ group1 = scores[index]
+ group2 = scores[-index]
+ t.values[i] = t.test(group1,group2)$statistic
+ }
First, we cleaned out some objects we wanted to use again and that "we" forgot
to clean out when we were done with them last time. Then we set the number of
sims (replications, resamplings) to 999. Then we made a copy of the data, which
was really unnecessary. Then we established a storage vector for the results of
the sims. Then we began looping. Inside the loop, we took a random sample of the
vector 1 to 20, without replacement, and stored that in an object called
"index". These index values were then used to extract the first resampled group
of scores. The second group of scores was extracted using a trick, which in
words goes, "for group two take all the scores that are not indexed by the
values in index." In other words, put the remaining scores in group two. Then we
did the t-test and stored the test statistic (t-value). It now remains to
compare those simulated t-values to the one we obtained above when we did the
actual t-test...
> t.values = abs(t.values) # for a two-tailed test
> mean(t.values<=1.8608) # using the same trick we used above
[1] 0.0920921
The resampling scheme tells us that a little more than 9% of the sims gave us a
result as extreme or more extreme than the obtained case. This is the p-value
resulting from the randomization test. We can see that it is pretty close to the
p-value obtained in the actual t-test above (.0794). The difference may be
nothing more than random noise (the standard error is .009), or we may be seeing
that the t-test was a bit too generous here, perhaps due to nonnormality.
And this time, let's not forget...
> detach(sleep)
> rm(list=ls())
Bootstrap Resampling
Bootstrap resampling is similar to the above randomization procedure, except
the resampling is done WITH replacement. The idea is to consider the pooled
sample to be a mini-population of scores. Presumably, we were sampling from a
larger parent distribution, but we know nothing about it for certain. The only
information we have about the parent distribution is the sample we've obtained
from it. Therefore, we will calculate a sampling distribution for the test
statistic by resampling from our mini-population WITH replacement, calculating
the desired statistic, and taking a look at what we get...
> with(sleep, t.test(extra~group)$statistic) # for reference
t
-1.860813
> scores = sleep$extra # the data
> R = 999 # the number of replicates
> t.values = numeric(R) # storage for the results
> for (i in 1:R) {
+ group1 = sample(scores, size=10, replace=T)
+ group2 = sample(scores, size=10, replace=T)
+ t.values[i] = t.test(group1,group2)$statistic
+ }
The only real difference here is that we have resampled with "replace=T", thus
resampling with replacement from the pooled data set. The sampling distribution
of the t-statistic can be visualized in a histogram, and I will plot the
obtained t-value as a point on the x-axis...
> hist(t.values, breaks=20)
> points(-1.8608,0, pch=16)
The simulation p-value is obtained as previously...
> mean(t.values<=-1.8608) # one-tailed (lower)
[1] 0.04204204
> t.values = abs(t.values) # two-tailed
> mean(t.values>=1.8608)
[1] 0.08508509
The result is very similar to the randomization method result.
Note: R contains a library called "boot", the main function within which is
boot( ). Supposedly, this automates the whole resampling procedure.
However, I have found it almost impossible to use, and certainly more difficult
than the above methods. All the examples I've seen are much more complex than
the problems above and (I might add) poorly explained. I've yet to get the
function to produce a meaningful result. When I do, there will be a revision
here. If you wish to give it a try, read the help page first, then you might
try Canty (2002) and Rizzo (2008). And good luck to you. I hope you have better
luck than I have!
Bootstrapped Confidence Intervals
Bootstrap resampling is useful for estimating confidence intervals from
samples when the sample is from an unknown (and clearly nonnormal) distribution.
The data set "crabs" in the MASS package supplies an example. We will look at
the carapace length of blue crabs...
> data(crabs, package="MASS")
> cara = crabs$CL[crabs$sp=="B"]
> summary(cara)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.70 24.85 30.10 30.06 34.60 47.10
> length(cara)
[1] 100
> qqnorm(cara)
Hmmmm, that could be normal, but who can say for sure? So to get a 95% CI for
the mean, we will use a bootstrap scheme rather than methods based on a normal
distribution. The procedure is to take a large number of bootstrap samples and
to examine the desired statistic...
> R = 999
> boot.means = numeric(R)
> for (i in 1:R) {
+ boot.sample = sample(cara, 100, replace=T)
+ boot.means[i] = mean(boot.sample)
+ }
> quantile(boot.means, c(.025,.975))
2.5% 97.5%
28.70265 31.44220
The bootstrapped CI is reasonably close to the one obtained by traditional
methods...
> mean(cara)-1.96*sd(cara)/sqrt(length(cara))
[1] 28.70507
> mean(cara)+1.96*sd(cara)/sqrt(length(cara))
[1] 31.41093
Extension to more complex problems is straightforward.
Update: The boot( ) Function
I finally found a more-or-less sensible explanation of the boot( )
function in Crawley (2007). I can't claim anything near "complete
understanding", but I'm working on it. Meanwhile, here is a heavily commented
example showing how to bootstrap a confidence interval...
> library(boot)
> ### The syntax is: boot(data= , statistic= , R= )
> ### Send it the name of a vector or data frame for "data". "R" is the
> ### number of replications you want. The tricky part is "statistic=",
> ### where you have to send it a function you've written. If you're
> ### going to get fancy, you might want to read the Writing Your Own
> ### Functions tutorial first.
> data(crabs, package="MASS")
> ### Not necessary if you haven't yet cleaned up from the previous section.
> cara = crabs$CL[crabs$sp=="B"]
> ### The carapace lengths of 100 blue crabs. See previous section.
> ### And now we have to write a function to calculate the means of these...
> the.means = function(cara, i) {mean(cara[i])}
> ### Finally, we run the bootstrap...
> boot(data=cara, statistic=the.means, R=999)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = cara, statistic = the.means, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 30.058 0.02926827 0.6820252
You MUST write your own function to calculate the statistic you are
bootstrapping, even if that function is already built into R. If it is a
built-in function, then writing your own function is a simple matter, as you
witnessed above. You must also include the index i, which R uses
internally to do the bootstrap. I don't know why. I'm led to believe this makes
the boot( ) function much more versatile. Finally, the output is
interpreted as so. The value "original" is the mean of the original data
vector "cara". The value "bias" gives the difference between "original" and the
mean of the bootstrapped values for this statistic. Clearly, "bias" ought to be
close to zero. The value "std. error" is the standard deviation of the
bootstrapped means. To get a confidence interval, the output of boot( )
should be stored...
> boot(data=cara, statistic=the.means, R=999) -> boot.out
> quantile(boot.out$t, c(.025,.975))
2.5% 97.5%
28.73275 31.42255
The result is similar, but not identical, to the result we got in the previous
section.
Bootstrapping is more usefully applied to statistics like the median, for
which there is no formula for CIs when the distribution is not normal...
> the.medians = function(cara, i) {median(cara[i])}
> boot(data=cara, statistic=the.medians, R=999) -> boot.out2
> boot.out2
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = cara, statistic = the.medians, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 30.1 0.09814815 1.477237
> quantile(boot.out2$t, c(.025,.5,.975))
2.5% 50% 97.5%
27.7 30.1 32.4
I still can't get it to do anything more complex than a simple, one-variable
confidence interval though, so obviously some R guru needs to write some
COMPREHENSIBLE examples using this procedure to do things that ordinary people
might actually want to do!
> detach("package:boot")
Bootstrapping a Oneway ANOVA
Let's do something a little more complex than a confidence interval!
Yesterday, I wrote the tutorial on Oneway ANOVA, and
in that, I used the data frame "InsectSprays" for an example. However,
"InsectSprays" has some severe problems with normality and homogeneity of
variance, which means the theoretical F-distribution may not apply. How can we
get a sampling distribution that does apply? First, let's use Monte Carlo
simulation to reproduce the theoretical distribution...
> data(InsectSprays) # look at ?InsectSprays to get the skinny
> with(InsectSprays,tapply(count,spray,mean))
A B C D E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
> with(InsectSprays,tapply(count,spray,var))
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
> with(InsectSprays,tapply(count,spray,length))
A B C D E F
12 12 12 12 12 12
> summary(aov(count~spray, data=InsectSprays))
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 2668.83 533.77 34.702 < 2.2e-16 ***
Residuals 66 1015.17 15.38
Those are some ugly variances! The help page for these data recommend a data
transform. Let's assume the pooled variance of 15.38 applies to all groups and
get the F-distribution as follows (copy and paste this script into R)...
meanstar = mean(InsectSprays$count)
sdstar = sqrt(15.38)
simspray = InsectSprays$spray
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = rnorm(12, mean=meanstar, sd=sdstar)
groupB = rnorm(12, mean=meanstar, sd=sdstar)
groupC = rnorm(12, mean=meanstar, sd=sdstar)
groupD = rnorm(12, mean=meanstar, sd=sdstar)
groupE = rnorm(12, mean=meanstar, sd=sdstar)
groupF = rnorm(12, mean=meanstar, sd=sdstar)
simcount = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simcount,simspray)
Fstar[i] = oneway.test(simcount~simspray, var.equal=T, data=simdata)$statistic
}
Go get a snack here! Well, that took awhile, didn't it? "Fstar" should now
contain our simulated F-distribution for df1=5 and df2=66 degrees of freedom...
> hist(Fstar, prob=T)
> x=seq(.25,5.25,.5)
> points(x,y=df(x,5,66),type="b",col="red")
Not too bad!
Now let's do the same, except this time we will use the InsectSprays data to
get the Fstar-distribution. We will center all our groups on the same mean
(zero), but we will leave the variance and shape of the individual group
distributions undisturbed (copy and paste this script into R)...
rm(list=ls())
data(InsectSprays)
meanstar = with(InsectSprays, tapply(count,spray,mean))
grpA = InsectSprays$count[InsectSprays$spray=="A"] - meanstar[1]
grpB = InsectSprays$count[InsectSprays$spray=="B"] - meanstar[2]
grpC = InsectSprays$count[InsectSprays$spray=="C"] - meanstar[3]
grpD = InsectSprays$count[InsectSprays$spray=="D"] - meanstar[4]
grpE = InsectSprays$count[InsectSprays$spray=="E"] - meanstar[5]
grpF = InsectSprays$count[InsectSprays$spray=="F"] - meanstar[6]
simspray = InsectSprays$spray
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=12, replace=T)
groupB = sample(grpB, size=12, replace=T)
groupC = sample(grpC, size=12, replace=T)
groupD = sample(grpD, size=12, replace=T)
groupE = sample(grpE, size=12, replace=T)
groupF = sample(grpF, size=12, replace=T)
simcount = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simcount,simspray)
Fstar[i] = oneway.test(simcount~simspray, var.equal=T, data=simdata)$statistic
}
Wait for it! (That took about a minute and 20 seconds on my computer.) We now
have a bootstrapped "F"-distribution in "Fstar" based on equal means (the null
hypothesis), but normality and homogeneity are no longer assumed. Let's see what
it looks like...
> max(Fstar)
[1] 10.54805
> hist(Fstar, breaks=seq(0,11,.5), ylim=c(0,.7), prob=T)
> x=seq(.25,6.75,.5)
> points(x,y=df(x,5,66),type="b",col="red")
It's a bit different from the theoretical distribution in that it appears to be
heavier in the tails...
> qf(.95,5,66)
[1] 2.353809
> quantile(Fstar,.95)
95%
2.790884
The critical value of F(5,66) at alpha=.05 is 2.35, but from the bootstrapped
Fstar distribution it appears to be 2.79. Either way, Fobt=34.7
allows the null to be rejected. It might have been a different story though if
the effect had been a marginal one.
If there is an easier way to do that using the boot( ) function, I'd
love to hear about it. I can't get it to work!
Return to the Table of Contents
|