RESAMPLING TECHNIQUES Caveat emptor Resampling techniques depend upon repeated (re)randomization or simulation of a sample. Computers do not generate random numbers. Since an algorithm is used to produce the results of a function like runif(), these results are technically referred to as pseudorandom. I've done a few casual tests of the R random number generator, and it seems to be very good, but I'm not an expert on pseudorandom number generators. So I will begin with a warning: Computer intensive resampling is only as good as your pseudorandom number generator. Further Notes I am scratching the surface of resampling techniques in this tutorial. I don't claim to have extensive knowledge, so the examples here will be fairly simple, just to illustrate what resampling is and how it can be done in R. Speaking of "how it can be done in R," those of you who may have read this tutorial before it's latest revision have had an earful (eyeful?) of my opinion of the boot() function and its help page. FINALLY, someone has written a good, comprehensible explanation of the function (Zieffler, Harring, and Long, 2011, see refs). I've also found some good examples online. I will try to do it a little more justice in this revision. Concerning the "sleep" data set, which is used here: These data are actually paired scores (see the help page), although they have often been used as an example for unpaired tests. For the sake of illustrating how things work, it doesn't really make much difference, but I just wanted to let you know that I know these are paired scores. We'll use them both ways. I will illustrate three basic techniques in this tutorial.
Monte Carlo Estimation of Power Although the term "resampling" is often used to refer to any repeated random
or pseudorandom sampling simulation, when the "resampling" is done from a known
theoretical distribution, the correct term is "Monte Carlo" simulation. I will
use such a scheme here to demonstrate how power can be estimated for the
two-sample t-test using Student's sleep data.
> data(sleep) > str(sleep) 'data.frame': 20 obs. of 3 variables: $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ... $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ... $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ... > attach(sleep) > tapply(extra, group, mean) 1 2 0.75 2.33 > tapply(extra, group, sd) 1 2 1.789010 2.002249 > tapply(extra, group, length) 1 2 10 10The experiment is on the effect of a soporific drug. All subjects were measured during a baseline (no-drug) period and again during the drug trial, and the "extra" variable is how much additional sleep time they got over baseline during the drug trial. Group 1 is the control (placebo) condition, and Group 2 is the treatment (drug) condition. To make it easier to keep track of this, I'm going to rename the levels of the "group" factor. > detach(sleep) # NEVER modify an attached data frame! > sleep$group = factor(sleep$group, labels=c("control","treatment")) > summary(sleep) extra group ID Min. :-1.600 control :10 1 :2 1st Qu.:-0.025 treatment:10 2 :2 Median : 0.950 3 :2 Mean : 1.540 4 :2 3rd Qu.: 3.400 5 :2 Max. : 5.500 6 :2 (Other):8 > with(sleep, tapply(extra, group, mean)) # just a quick check control treatment 0.75 2.33 > attach(sleep) # looks like I got it right, so reattachHere is the t-test (treating the scores as if they were from independent samples). > t.test(extra ~ group, var.eq=T) Two Sample t-test data: extra by group t = -1.8608, df = 18, p-value = 0.07919 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.363874 0.203874 sample estimates: mean in group control mean in group treatment 0.75 2.33We'll use these results as if they were from a pilot study. We now want to know what we have to do to get a publishable result here (i.e., p < .05), assuming our pilot data reasonably represent nature. We didn't quite reach statistical significance with n = 10 per group. How many subjects/group should we use to get a power of 80%? The power.t.test() function gives us a way to
get some idea of the answer. The function wants to be fed several arguments:
n = the number of subjects/group, delta = the anticipated difference between
the means, sd = the anticipated pooled standard deviation of the two groups,
sig.level = the significance level we desire for the test, type = the type of
t-test we will use ("two.sample" is the default), and alternative = the nature
of the alternative hypothesis ("two.sided" is the default). Exactly one of those
arguments should be set to NULL, and the function will estimate it for us from
the other values.
> power.t.test(n=NULL, delta=1.58, sd=1.9, sig.level=.05, power=.8) Two-sample t test power calculation n = 23.70004 delta = 1.58 sd = 1.9 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* groupWe are being told to use 24 subjects/group. We've made the usual t-test assumptions of sampling from normal distributions and equal variances. But the variances were not quite equal in our samples. The SDs were 1.8 and 2.0. Let's use those values to get an estimate of power from a Monte Carlo simulation. We will simulate drawing samples of 24 from normal distributions with means and SDs equal to those we saw in our sample. What we are about to do is, perhaps, best done with a script. That way, if
something goes wrong, we don't have to re-enter multiple lines into the R
Console. However, we will live on the edge for the sake of illustration. To
get this to work with minimal effort on our part--always a good thing--we will
need to get R to do the same set of calculations many times. This is done with
a loop, and in R, loops are most often done using
for(). The syntax is
"for (counter in 1 to number of simulations)". The statements to be looped
through repeatedly then follow enclosed in curly braces. I feel a bit like I'm
trying to describe an elephant to someone who's never seen one before. Perhaps
it would be best to just show you the elephant.
> R = 1000 > alpha = numeric(R) > for (i in 1:R) { + group1 = rnorm(24, mean=.75, sd=1.8) + group2 = rnorm(24, mean=2.33, sd=2.0) + alpha[i] = t.test(group1, group2, var.eq=T)$p.value + }The first line defines the number of random simulations we want and stores this value in the semi-official data object that R typically uses for such things, namely R. We've done 1000 sims, because that's the number that is normally done. Some people use R=999, others use 10,000, just because they can. Computers are pretty fast these days. 1000 usually gives a reasonably accurate result. The second line creates a numeric vector with 1000 elements, which will hold the result of each simulation. The third line begins our "for loop": "for i (the counter) going from 1 to 1000", do the following stuff. I.e., do the following stuff 1000 times, keeping track by incrementing the counter, i, at each step. Group 1 is generated. Group 2 is generated. The t-test is done, and the p-value is extracted and stored in "alpha", our designated storage place. In each simulation, the storage occurs in the ith position of the alpha vector. The loop was executed as soon as we typed the close curly brace. Now all that remains is to determine how many of those values are less than .05. > mean(alpha < .05) [1] 0.787To do this, we played a nice little trick. Remember, "alpha<.05" generates a logical vector of TRUEs and FALSEs, in which TRUEs count as ones and FALSEs count as zeros. We took the mean of that vector, which is to say, we calculated by a bit of trickery the proportion of TRUEs. Our simulation tells us we have a 78.7% chance of rejecting the null hypothesis under these conditions, not quite the 80% that we desired. Your results will differ, of course, because we are using a randomizing procedure after all. Notice this made quite a mess of your workspace.
> ls() [1] "alpha" "group1" "group2" "i" "R" "sleep" > rm(alpha, group1, group2, i, R)You might want to clean up a little bit. Doing It With boot() In the "boot" library there is a function,
boot(), which supposedly automates this process and makes it more
convenient. The library is not loaded at start-up, so first we have to load it.
> library("boot")I should point out, at least for simple bootstraps, I don't see how this method is in any way "convenient." You still have to write your own functions, even when the function you want is one that is predefined in R. Seems to me that doing it this way can actually be MORE work than what we've just done. Nevertheless, glutton for punishment that I am, I will see if I can get it to work. Here is the syntax for boot, and good luck to you if you have only the help page to guide you! boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"), Okay, got it? Nope! I can't find any way to make boot() actually do this. The help page says the function generates bootstrap replicates, so let's see what that's about. Update This might do it, but it took a lot of trickery, and I'm not sure it's
right. See the section below on parametric bootstrap for an explanation. (This
method seems quite obtuse compared to the method that was presented in the previous
section, so I'm not going to put a lot of effort into explaining it!)
> ran.gen = function(data, mle) { + c(rnorm(24, mean=.75, sd=1.789), rnorm(24, mean=2.33, sd=2.002)) + } > alpha = function(data, i) { + t.test(data[1:24], data[25:48], var.eq=T)$p.value + } > fakevect = rnorm(48, mean=1.54, sd=1.9) > b.out = boot(data=fakevect, statistic=alpha, R=1000, sim="parametric", ran.gen=ran.gen) > mean(b.out$t < .05) [1] 0.792The estimated power for groups of 24 each is 79.2%. The Bootstrap Bootstrap resampling makes no assumptions about the populations other than what we can know from the samples. Bootstrapping treats the samples as mini-populations and resamples from them with replacement to get what are called bootstrapped samples. These bootstrapped samples are then used to get p-values, estimates of population parameters, confidence intervals, etc. Above, in the "sleep" example, we acknowledged that the variances of the two samples were not homogeneous, but we still assumed that we were sampling from normal distributions. How did we know that? We didn't. (That's why it's called an assumption!) Let's acknowledge that as well. Here is a bootstrapped sample from sleep$extra, group 1. (I still have
"sleep" attached. If you don't, attach it. If you do, DON'T attach it again!)
> sample(extra[1:10], size=24, replace=T) [1] 3.4 -0.1 3.7 -0.1 0.7 -0.2 0.0 0.7 3.4 0.7 3.4 0.8 3.7 3.4 2.0 0.7 [17] 0.0 -0.1 -0.2 -0.1 3.7 -1.2 -0.2 -1.2Your result will be different because sample() takes a random sample from the specified population (first argument). The critical feature here is the sample is taken with replacement. Let's see how many times we can get an alpha less than .05 if we do this from both groups and calculate a t-test. Once again, use a script window if you're uncertain that you can type this correctly on the first try. > R = 1000 > alpha = numeric(R) > for (i in 1:R) { + group1 = sample(extra[1:10], size=24, replace=T) + group2 = sample(extra[11:20], size=24, replace=T) + alpha[i] = t.test(group1, group2, var.eq=T)$p.value + } > mean(alpha < .05) [1] 0.844The bootstrap is telling us we're okay here with samples of 24 each if we want a power of at least 80%. This is not necessarily a "different" result than the one we got above, because this is based on random (re)sampling, after all, and we will get a different answer every time we do it. You got a different answer than I did, right? I just ran it three more times, however, and my answers are pretty consistently 83-86%, which I have to assume means we are seeing a difference when we resample from the original data without making the normality assumption. Since we didn't make any assumptions about the parent population, this is called a nonparametric bootstrap. > detach(sleep) # and clean up; keep sleep "Automating" It With boot() Here's one way you could do the same thing with the boot() function, and I'm going to show you this just
to illustrate how the function works, sort of. Bear with me if you know better.
The function will require three pieces of information
from you: the name of the data set (vector, matrix, or data frame), a function
that YOU have written that defines the statistic you want to have bootstrapped,
and R. I'll call my function "alpha", give it the "sleep" data frame, and I
also have to give it an i, which the function will use internally to
keep track of what iteration it's on. (Note: I detached sleep before I did this.)
> alpha = function(sleep, i) { + group1 = sample(sleep$extra[1:10], 24, replace=T) + group2 = sample(sleep$extra[11:20], 24, replace=T) + t.test(group1, group2, var.eq=T)$p.value } > boot.out = boot(data=sleep, statistic=alpha, R=1000) > mean(boot.out$t < .05) [1] 0.852The model object, boot.out, contains a fair amount of information. Do names(boot.out) to see it all and then don't worry about most of it. The relevant output, the p-values in this case, are in boot.out$t. That's the equivalent of our alpha vector above. You should be thinking, that was really a pointless exercise! Why wrap the boot function around the same exact thing we did above without the boot function? What's the point of that? I wanted to show you that you have to write your own function to be used by boot, and that the output of that function, applied to your data, is in t inside the boot.out object. Other than that, you're right. It was quite pointless. All we did was duplicate the action of the previous script, but we let boot store the results for us. There's gotta be more to it than that, right? The problem that boot() has with this example is that we are trying to draw bootstrap samples that are larger than the original data vector. I've seen a few sources where they get around this by using a script similar to the one earlier in this section. Most sources recommend the parametric bootstrap (i.e., Monte Carlo similulation) for doing power problems. I might point out, however, that with 24 subjects per group and a two-tailed
t-test at alpha of .05, you would need an effect size of about 0.42 (Cohen's d)
to find a significant difference between the means. How often could we count on
getting an effect size that large or larger from data like these?
> cohensd = function(data, i) { + d = data[i,] + abs(t.test(d$extra ~ d$group, var.eq=T)$statistic * sqrt(2/10)) + } > boot.out = boot(data=sleep, statistic=cohensd, R=1000) > mean(boot.out$t >= .42) [1] 0.841Now we see the boot() function in its element. First, we need to write a function to calculate Cohen's d, and this one is based on the easily verifiable fact that d = t * sqrt(2/n) when there are n subjects per group in a t-test. The function is given two arguments, "data" (which will be fed to the boot() function when that is executed) and "i", the indexing variable. We will tell boot that data=sleep, a data frame. The boot() function will take care of permuting that for us. So we hand over each permuted data frame to a variable called d, and then we use that to calculate a t-test, extract the t-value with $statistic, and then multiply by sqrt(2/10) to convert it to Cohen's d, which is always positive, so the absolute value is taken as well. Then we feed the boot() function our data, our function, and R, and let it do it's thing. And it tells us we have about an 84% chance of getting an effect this large or larger from data like these, which is consistent with the last result. (Phew! It took me a long time to think of that!) Bootstrapping The t-Test The idea is to consider the pooled sample to be a mini-population of scores. Presumably, we were sampling from a larger parent distribution, but we know nothing about it for certain. The only information we have about the parent distribution is the sample we've obtained from it. Under the null hypothesis, the two groups, control and treatment, were effectively sampled from the same population. Thus, we pool the two samples to emulate that population as best we can. Then we simply bootstrap two samples from that mini-population and see how likely it is to get a result more extreme than the one we actually observed. We'll need some sort of statistic to determine "extremeness." The difference between means would be adequate. We'll do that below. It would actually be a little easier to just let R calculate t-values from
our bootstrapped samples. That will require a little more internal effort from
our hardware, but since computing power is more than up to the task these days,
what do we care? There's also an educational lesson here (students!). The
problem with just calculating a t-test and looking up critical values in the
back of the book is that, if we haven't met the assumptions of the test, then
our calculated value of t will not be distributed theoretically as t, so
comparing our calculated t to the tables will be a bit off.
Therefore, we will calculate a new sampling distribution for the test
statistic by resampling from our mini-population WITH replacement, calculating
the desired statistic, and taking a look at what we get. (Notice that since we
pooled the samples, we are still sort of making a homogeneity of variance
assumption by doing this. We are also beginning with the assumption that the
null hypothesis is true.)
> with(sleep, t.test(extra ~ group, var.eq=T)$statistic) # for reference t -1.860813 > scores = sleep$extra # the data > R = 1000 # the number of replicates > t.star = numeric(R) # storage for the results > for (i in 1:R) { + group1 = sample(scores, size=10, replace=T) + group2 = sample(scores, size=10, replace=T) + t.star[i] = t.test(group1, group2, var.eq=T)$statistic + }The (estimated or approximated or simulated) sampling distribution of the t-statistic can be visualized in a histogram, and I will plot the obtained t-value as a point on the x-axis. Finally, I'll draw a line over it showing the theoretical t-distribution for df=18. > hist(t.star, freq=F) > points(x=-1.8608, y=0, pch=16) > lines(x<-seq(from=-4, to=4, by=.1), y=dt(x, df=18), col="red") > quantile(t.star, c(.025,.05,.5,.950,.975)) 2.5% 5% 50% 95% 97.5% -1.92422717 -1.65896181 -0.07050922 1.62722956 2.09215955The simulation p-value is obtained as previously. > mean(t.star <= -1.8608) # one-tailed [1] 0.031 > mean(abs(t.star) >= 1.8608) # two-tailed [1] 0.065Our reference two-tailed p-value from the "actual" t-test is 0.079. Automating It With boot() Let's bootstrap the actual t-test. (Btw, you can bet I'm typing this into
a script window!)
> boot.tee = function(data, i) { + d = data[i,] + c(t.test(d$extra[1:10], d$extra[11:20], var.eq=T)$statistic, + t.test(d$extra[1:10], d$extra[11:20], var.eq=T)$p.value) + } > boot.out = boot(data=sleep, statistic=boot.tee, R=1000)Here's the explanation. First, I defined a function called boot.tee, which is a function of the data and i. Our boot.tee function will find out what the "data" are when we tell that to boot(). I handed each permuted value of the data to a variable called d. (I don't know if this is standard, but it's a trick I learned in Zieffler, et al, and I thought it was very clever.) Then I created a vector containing two output values: the t-value from the t-test, and p-value from that test. Notice when I did that, I did not use a formula interface, DV~IV. I would do that if I was starting with the assumption that the alternative hypothesis is true, as that would cause group 1 values to be bootstrapped from group 1, and group 2 values to be bootstrapped from group 2 of the original data. In other words, when the data frame is permuted, all columns of it are permuted, so scores (extra) remain paired with the group from whence they came in any given row of the permuted data. Instead, I was starting with the assumption that the null hypothesis is true. I let boot permute the data frame into a random order, jumbling up the rows in the process, and then I tested the first ten scores against the last ten. If I want to check to see if this function is working properly, I can run
it on the original data frame and see if it gives me the right answers.
> boot.tee(sleep) t -1.86081347 0.07918671 Then I ran the boot() function (and immediately got an error because it's
a new day, and I forgot to load the "boot" library). Let's see what we got.
> boot.out ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = sleep, statistic = boot.tee, R = 1000) Bootstrap Statistics : original bias std. error t1* -1.86081347 1.8628096 1.0646010 t2* 0.07918671 0.4121642 0.2867875I get output for both items I calculated as part of the statistic. First, t1* is the t-value, and second, t2* is the p-value. The first thing I'm told is what the values are when the function is run on the original data ("original"). Then I'm told how far off the bootstrapped values are from that ("bias"), on the average. Finally, I'm given the standard deviation of the 1000 bootstrapped value ("std. error"). > summary(boot.out$t) # boot.out$t is a matrix in this case V1 V2 Min. :-4.029412 Min. :0.0007866 1st Qu.:-0.685015 1st Qu.:0.2468949 Median :-0.011462 Median :0.4913727 Mean : 0.001996 Mean :0.4913510 3rd Qu.: 0.722583 3rd Qu.:0.7358411 Max. : 3.242622 Max. :1.0000000The actual bootstrapped values are inside boot.out in a variable called t, so I asked for a summary of that. V1 is the first thing I calculated (t-value), and V2 is the second thing I calculated (p-value). Notice that the means in both cases are equal to original + bias from the first output. For finer grained info, we'll run quantile(), and we MUST remember to ask for a specific column of the t matrix. Otherwise, we'll be getting quantiles from the t-values and p-values all mixed together. Not useful! > quantile(boot.out$t[,1], c(.025,.05,.5,.95,.975)) 2.5% 5% 50% 95% 97.5% -2.02421421 -1.75560123 -0.01146238 1.77771580 2.10951478The result is very similar to what we got in t.star, but not exactly the same because this was a new set of permutations of the data. The p-values also differ a bit. > mean(boot.out$t[,1] < -1.8608) # one-tailed [1] 0.038 > mean(abs(boot.out$t[,1]) >= 1.8608) # two-tailed [1] 0.081Note: There is a function in the "boot" library called boot.ci(), but the output doesn't agree with above, and I'm not sure why. As usual, the help page wasn't much help. The explanation is NOT that the output object is a matrix, because it doesn't work (for me) if only one stat is calculated. So I don't use this function! Here is an example of how to use it from stackoverflow.com. > x <- rnorm(100, 1, .5) > b <- boot(x, function(u,i) mean(u[i]), R = 999) > boot.ci(b, type = c("norm", "basic", "perc")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 999 bootstrap replicates CALL : boot.ci(boot.out = b, type = c("norm", "basic", "perc")) Intervals : Level Normal Basic Percentile 95% ( 0.962, 1.206 ) ( 0.961, 1.205 ) ( 0.963, 1.206 ) Calculations and Intervals on Original Scale > quantile(b$t,c(.025,.975)) # for comparison 2.5% 97.5% 0.9632601 1.2059236Seems to work here. Didn't work above. Good luck! Note: When a test statistic such as a Student's t is bootstrapped, Zieffler, Harring, and Long (2011, see refs) call this bootstrapping using a pivot statistic (p. 163). They imply that it's a bit safer than the parametric bootstrap, which is outlined below, because fewer population parameters must be "known." Bootstrapping The Difference in Means Assuming the Null Hypothesis is True First, we'll combine all the DV values into one "population" and pull
both bootstrap samples from that. Our obtained value for the difference in
means is 1.58, for reference.
> mean.diff = function(data, i) { + d = data[i,] + mean(d$extra[11:20]) - mean(d$extra[1:10]) + } > mean.diff(sleep) # test function on original data [1] 1.58 > boot.out = boot(data=sleep, statistic=mean.diff, R=1000) > quantile(boot.out$t, c(.025,.05,.5,.95,.975)) 2.5% 5% 50% 95% 97.5% -1.72025 -1.44050 -0.04000 1.44050 1.67025 > mean(boot.out$t >= 1.58) # one-tailed p-value [1] 0.035 > mean(abs(boot.out$t) >= 1.58) # two-tailed p-value [1] 0.068The bootstrapped 95% CI on the mean difference values consistent with the null is -1.72 to 1.67. This interval includes the reference value. In this case, we handed our function the data (to be specified in boot) and an indexing variable i, we passed off the current (ith) bootstrap of the data to d. The data is to be a data frame, so two indices are needed on data[i,]. The first index, i, contains a bootstrapped sample of the rows 1:20. The second is blank, which means all columns. Since the rows of the data have now been randomly permuted, or actually randomly bootstrap sampled, any arbitrary 10 values can be used to calculate the first mean, and the other 10 values the second mean. The easiest way is 11:20 for the first mean and 1:10 for the second mean, or vice versa. (Note: This method is illustrated by Zieffler et al, p. 160.) The same procedure without boot() would look
like this.
> data = sleep > R = 1000 > results = numeric(R) > for (i in 1:R) { + indices = sample(1:20, size=20, replace=T) + d = data[indices,] + results[i] = mean(d$extra[11:20]) - mean(d$extra[1:10]) + } > quantile(results, c(.025,.05,.5,.95,.975)) 2.5% 5% 50% 95% 97.5% -1.66000 -1.40150 -0.00500 1.41050 1.70025 > mean(abs(results) >= 1.58) [1] 0.067 Not Assuming the Null Hypothesis is True This time we will take bootstrap samples from the individual groups. For
reference, the obtained mean difference is 1.58, and the parametric 95%
confidence interval is -0.20 to +3.36.
> mean.diff = function(data, i) { + d = data[i,] + mean(d$extra[d$group==2]) - mean(d$extra[d$group==1]) + } > mean.diff(sleep) # test function on original data [1] 1.58 > boot.out = boot(data=sleep, statistic=mean.diff, R=1000, strata=sleep$group) > quantile(boot.out$t, c(.025,.05,.5,.95,.975)) 2.5% 5% 50% 95% 97.5% -0.03000 0.23950 1.61000 2.87000 3.09075 > mean(boot.out$t <= 0) # one-tailed p-value [1] 0.028The bootstrapped 95% CI on the mean difference is -0.03 to 3.09, a little narrower than the parametric value. I've run this many times, and the lower bound of the CI jumps back and forth across 0. Even at R=10000, the lower bound can't quite decide which side of 0 it wants to be on. So this is a close one. When the order of rows in the data frame is permuted (bootstrapped), the group information remains with the data values. So this time the mean.diff function calculates the difference between a mean of values that came from group 2 and a mean of values that came from group 1. The strata= option is not actually necessary to make the procedure run and give an answer, but it seems to make a slight difference in the width of the CI, so I'm not sure how it's used, but it should be included to tell the procedure that the two groups are being sampled separately. (Note: This method is illustrated by Zieffler et al, p. 190.) The same procedure without boot() would look
like this.
> data = sleep > R = 1000 > results = numeric(R) > for (i in 1:R) { + indices = sample(1:20, size=20, replace=T) + d = data[indices,] + results[i] = mean(d$extra[d$group==2]) - mean(d$extra[d$group==1]) + } > quantile(results, c(.025,.05,.5,.95,.975)) 2.5% 5% 50% 95% 97.5% 0.0248750 0.1899048 1.5415862 2.9043119 3.1559799 > mean(results <= 0) [1] 0.024Important note: How do you get a poor result from a bootstrap? Use small samples. While I've used a small sample case above to illustrate the methods, you should probably be wary of using bootstrap in such cases. Permutation and Randomization Tests: The Jackknife I will use the sleep data to illustrate our next topic as well, which is permutation tests. The logic behind a permutation test is straightforward. It says, "Okay, these
are the twenty (in this example) scores we got, and the way they are divided up
between the two groups is one possible permutation of them. There are, in
fact...
> choose(20,10) [1] 184756...possible permutations (technically combinations--but same idea) of these data into two groups of ten, and each is equally likely to occur if we were to pick one at random out of a hat. Most of these permutations would give no or little difference between the group means, but a few of them would give large differences. How extreme is the obtained case?" In other words, the logic of the permutation test is quite similar to the logic of the t-test. If the obtained case is in the most extreme 5% of possible results, then we reject the null hypothesis of no difference between the means (assuming we were looking at differences between means). The advantage of the permutation test is, it does not make any assumption about normality, and in fact, doesn't make any assumption at all about parent distributions. Furthermore, a permutation test is generally more powerful than a "traditional" nonparametric test. The disadvantage of a permutation test is the number of permutations that
must be generated. Even small samples, as we have here, will choke a typical
desktop computer with repetitive calculations. Therefore, instead of generating
all possible permutations, we generate only a random sample of them. This
procedure is often called a "randomization test" instead of a permutation test.
Let's see the elephant, and then I'll explain.
> ### Begin by cleaning up your workspace, if you haven't already. > R = 1000 > scores = sleep$extra > t.star = numeric(R) > for (i in 1:R) { + index = sample(1:20, size=10, replace=F) + group1 = scores[index] + group2 = scores[-index] + t.star[i] = t.test(group1, group2, var.eq=T)$statistic + }First, we set the number of sims (replications, resamplings) to 1000. Then we made a copy of the data, which was really unnecessary. Then we established a storage vector for the results of the sims. Then we began looping. Inside the loop, we took a random sample of the vector 1 to 20, without replacement, and stored that in an object called "index". These index values were then used to extract the first resampled group of scores. The second group of scores was extracted using a trick, which in words goes, "for group two take all the scores that are not indexed by the values in index." In other words, put the remaining scores in group two. Then we did the t-test and stored the test statistic (t-value). It now remains to compare those simulated t-values to the one we obtained above when we did the actual t-test. > mean(abs(t.star) >= 1.8608) # two-tailed test [1] 0.072The resampling scheme tells us that a little more than 7% of the sims gave us a result as extreme or more extreme than the obtained case. This is the p-value resulting from the randomization test. We can see that it is pretty close to the p-value obtained in the actual t-test above (.0794). The difference may be nothing more than random noise. (Note: Once again, your result will be different due to the randomization done by sample().) The bootstrap method simulates random sampling from a population. The randomization method, sometimes called jackknife resampling, simulates random assignment to groups. It's usually considered the preferred method if your subjects are volunteers or you snatched them out of the hallway during change of classes. Another Way There are as many ways to do randomization tests as there are statistics
you can think up to assess the difference between the groups. Here is another
way. Notice in this case I've use a different, but equally effective, method
of resampling as well. I merely rearrange (permute) the scores and use the
original grouping variable to index them.
> R = 1000 > scores = sleep$extra > groups = sleep$group > mean.diff = numeric(1000) > for (i in 1:R) { + perm.scores = sample(scores, size=20) + mean.diff[i] = diff(tapply(perm.scores, groups, mean)) + } > tapply(scores, groups, mean) 1 2 0.75 2.33 > 2.33 - .75 # the obtained difference [1] 1.58 > mean(abs(mean.diff) >= 1.58) [1] 0.086 The result is similar. We'd get a narrower range of results if the samples were larger. Paired Scores "Hey! These are paired scores. Deal with that!" First, let's get some
reference info from the standard t-test for paired scores.
> t.test(extra ~ group, data=sleep, paired=T) Paired t-test data: extra by group t = -4.0621, df = 9, p-value = 0.002833 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.4598858 -0.7001142 sample estimates: mean of the differences -1.58The logic behind randomization in this case is that it is a matter of chance which of the paired scores occurred in the control group and which occurred in the treatment group. In other words, when we look at the sign on the difference scores, it's a coin flip as to whether than sign will be positive or negative. The difference scores are... > diffs = sleep$extra[1:10] - sleep$extra[11:20] > diffs [1] -1.2 -2.4 -1.3 -1.3 0.0 -1.0 -1.8 -0.8 -4.6 -1.4...and a lot of them are negative, but the null hypothesis says that's just dumb luck. Given that those are the difference scores, it's a matter of pure chance what sign they have. And there is the randomization we need to justify a randomization test. > R = 1000 # we'll do 1000 sims > absdiffs = abs(diffs) # strip the signs > results = numeric(R) # create a storage vector > for (i in 1:R) { # begin looping + sign = sample(c(1,-1), 10, replace=T) # get 10 random signs + results[i] = mean(absdiffs * sign) # get mean of diffs with new signs + } > quantile(results, c(.025,.975)) # 95% of them fall in what interval? 2.5% 97.5% -1.16 1.14 > mean(abs(results) >= 1.58) # two-tailed p-value [1] 0.006The result is not too different from what the t-test revealed. You can completely clear your workspace now. We are done with "sleep". About Those p-Values Some sources propose that the p-values derived as above are ever so slightly too small and that a correction should be applied that can be calculated like this. Where p is the probability as calculated above and R is the number of replications. With R = 1000, the largest correction this will ever result in is less than .001, which is undoubtedly well within the boundaries of uncertainty on the p-value anyway. To apply a correction to the third or fourth decimal place of a value that's probably not entirely accurate in the second decimal place strikes me as being a little fussy. But there you have it. Use it as you see fit. With a smaller number of reps, it could be important. Note on terminology: As we are taking a random sample of possible permutations of the data in the jackknife, jackknife resampling can reasonably be referred to as Monto Carlo resampling, and often is. Bootstrapped Confidence Intervals Bootstrap resampling is useful for estimating confidence intervals from
samples when the sample is from an unknown (and clearly nonnormal) distribution.
The data set "crabs" in the MASS package supplies an example. We will look at
the carapace length of blue crabs.
> data(crabs, package="MASS") > cara = crabs$CL[crabs$sp=="B"] > summary(cara) Min. 1st Qu. Median Mean 3rd Qu. Max. 14.70 24.85 30.10 30.06 34.60 47.10 > length(cara) [1] 100 > qqnorm(cara) # not shownHmmmm, that looks reasonably normal to me, but who can say for sure? So to get a 95% CI for the mean, we will use a bootstrap scheme rather than methods based on a normal distribution. But just for reference purposes, the normal distribution method would give the following result. > t.test(cara)$conf.int [1] 28.68835 31.42765 attr(,"conf.level") [1] 0.95The procedure is to take a large number of bootstrap samples and to examine the desired statistic. > R = 1000 > boot.means = numeric(R) > for (i in 1:R) { + boot.sample = sample(cara, 100, replace=T) + boot.means[i] = mean(boot.sample) + } > quantile(boot.means, c(.025,.975)) 2.5% 97.5% 28.71027 31.37937The bootstrapped CI is reasonably close to the one obtained by traditional methods, so the normal approximation seems to be reasonable for these data. Extension to more complex problems is straightforward. Doing It With boot() I have to admit that using boot() may be just
a smidge easier in this case.
> boot.means = function(cara, i) {mean(cara[i])} > boot.out = boot(data=cara, statistic=boot.means, R=1000) > quantile(boot.out$t, c(.025,.975)) 2.5% 97.5% 28.75858 31.39305But certainly no less cryptic! It may also be useful just too look at the output of boot() directly. > boot.out ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = cara, statistic = boot.means, R = 1000) Bootstrap Statistics : original bias std. error t1* 30.058 0.015603 0.6902185The value "original" is the mean of the original data vector "cara". The value "bias" gives the difference between "original" and the mean of the bootstrapped values for this statistic. Clearly, "bias" ought to be close to zero. The value "std. error" is the standard deviation of the bootstrapped means. Bootstrapping is more usefully applied to statistics like the median, for
which there is no formula for CIs when the distribution is not normal.
> boot.medians = function(cara, i) {median(cara[i])} > boot.out = boot(data=cara, statistic=boot.medians, R=1000) > quantile(boot.out$t, c(.025,.975)) 2.5% 97.5% 27.85 32.40 > boot.out ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = cara, statistic = boot.medians, R = 1000) Bootstrap Statistics : original bias std. error t1* 30.1 0.08145 1.461501Clean up your workspace. It's a mess! The Parametric Bootstrap So far, and in future sections as well, what we've been doing is called the
nonparametric bootstrap, because no assumptions are being made about population
parameters. We will use some data from the MASS package to illustrate an
alternative method called the parametric bootstrap.
> data(birthwt, package="MASS") > head(birthwt) low age lwt race smoke ptl ht ui ftv bwt 85 0 19 182 2 0 0 0 1 0 2523 86 0 33 155 3 0 0 0 0 3 2551 87 0 20 105 1 1 0 0 0 1 2557 88 0 21 108 1 1 0 0 1 2 2594 89 0 18 107 1 1 0 0 1 0 2600 91 0 21 124 3 0 0 0 0 0 2622 > bw = birthwt[,c("smoke","bwt")] # we need only these two columns > bw$smoke = factor(bw$smoke, labels=c("no","yes")) > summary(bw) smoke bwt no :115 Min. : 709 yes: 74 1st Qu.:2414 Median :2977 Mean :2945 3rd Qu.:3487 Max. :4990 >rm(birthwt)The data show birthweights (bwt) in grams of babies born to mothers who smoked and mothers who did not smoke. The obvious point of interest is, what's the difference, if any, in the mean birthweights for the two groups? Let's look at the data.
> with(bw, tapply(bwt, smoke, mean)) no yes 3055.696 2771.919 > with(bw, tapply(bwt, smoke, var)) no yes 566492.0 435118.2 > plot(density(bw$bwt[bw$smoke=="yes"])) # not shown > lines(density(bw$bwt[bw$smoke=="no"]),col="red")The means are different by a bit more than 280 grams. The variances are also different, but not too much. The last two lines of code plot overlapping kernel density estimates of the distributions, which show that both of them are mound shaped and not too terribly different from normal. Thus, we will assume that both groups were sampled from the same normal distribution that has parameters equal to the sample mean and variance. (We are assuming the null hypothesis is true, and that our "bwt" variable is sampled from a single normal distribution.) The first step is to create a DV that consists of standardized values as if
they were drawn from a single population with a mean of 0 and a standard
deviation of 1. This can be done with the scale()
function.
> bw$z.bwt = scale(bw$bwt)Now we'll investigate the difference between the two groups on this variable. > with(bw, tapply(z.bwt, smoke, mean)) no yes 0.1523672 -0.2367869 > with(bw, diff(tapply(z.bwt, smoke, mean))) yes -0.3891541The difference is almost four tenths of a standard deviation. The syntax for the parametric bootstrap is: boot(data, statistic, R, sim="parametric", ran.gen) The last two options are new. One of them obviously tells R that we want a parametric bootstrap. The other one is another function that we have to write to generate Monte Carlo samples from the standard normal distribution (assuming it is the normal distribution that we have decided fits our data).bwt.gen = function(data, mle) { rnorm(n=length(data)) }I have no idea why the mle is necessary, but it is. As usual, we also need the function that calculates our test statistic. mean.diff = function(data, i) { mean(data[1:115]) - mean(data[116:189]) }Since we've generated a random vector of standard scores in bwt.gen, I just took the difference in means between the first 115 of them (representing nonsmokers) and the last 74 of them (representing smokers). Once those functions are executed and in the workspace, we can run boot(). > boot.out = boot(data=bw$z.bwt, statistic=mean.diff, R=1000, sim="parametric", ran.gen=bwt.gen)The results are in the usual place. > boot.out PARAMETRIC BOOTSTRAP Call: boot(data = bw$z.bwt, statistic = mean.diff, R = 1000, sim = "parametric", ran.gen = bwt.gen) Bootstrap Statistics : original bias std. error t1* 0.9683415 -0.965374 0.1527679 > summary(boot.out$t) V1 Min. :-0.553122 1st Qu.:-0.099277 Median : 0.006685 Mean : 0.002968 3rd Qu.: 0.104059 Max. : 0.504965 > quantile(boot.out$t, c(.025,.975)) # does not contain 0.3891541, so yay! 2.5% 97.5% -0.3040931 0.3042744 > mean(abs(boot.out$t) >= 0.3891541) # two-tailed p-value [1] 0.011For the record, the parametric p-value from the t-test is .0087. It seems to me that "parametric bootstrap" is just a new name for Monte Carlo simulation. How are they different? Bootstrapping a Regression Analysis First, thanks to John Fox for a very clear example of how to use boot() here, which I found in a document posted
here. Here is where the boot()
function is going to shine, I have to admit. We will look at a problem that we
also examine in the Multiple Regression tutorial,
in which we attempt to predict life expectancy by U.S. state using various
predictors (1977 data).
> states = state.x77 > states = as.data.frame(states) # must be a data frame > colnames(states)[c(4,6)] = c("Life.Exp","HS.Grad") # fix some variable names > lm(Life.Exp ~ Murder + HS.Grad + Frost, data=states) # for reference Call: lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = states) Coefficients: (Intercept) Murder HS.Grad Frost 71.036379 -0.283065 0.049949 -0.006912First, we define a function to get the statistic(s) we want, which are the regression coefficients. > boot.states = function(states, i) { + mod = lm(Life.Exp ~ Murder + HS.Grad + Frost, data=states[i,]) + coefficients(mod) + }Once again, the data are not just "states" but states[i,] from the ith iteration of the bootstrap. Now it remains to run boot() and examine the output. > boot.out = boot(data=states, statistic=boot.states, R=1000) > boot.out ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = states, statistic = boot.states, R = 1000) Bootstrap Statistics : original bias std. error t1* 71.036378813 -0.1117470290 1.111679000 # intercept t2* -0.283065168 0.0050440366 0.040044342 # Murder slope t3* 0.049948704 0.0009847503 0.017462937 # HS.Grad slope t4* -0.006911735 0.0002284539 0.003021178 # Frost slope > summary(boot.out$t) V1 V2 V3 V4 Min. :66.42 Min. :-0.3927 Min. :0.002082 Min. :-0.016386 1st Qu.:70.29 1st Qu.:-0.3056 1st Qu.:0.039119 1st Qu.:-0.008695 Median :71.03 Median :-0.2824 Median :0.050579 Median :-0.006672 Mean :70.92 Mean :-0.2780 Mean :0.050933 Mean :-0.006683 3rd Qu.:71.69 3rd Qu.:-0.2548 3rd Qu.:0.061776 3rd Qu.:-0.004735 Max. :73.49 Max. :-0.1007 Max. :0.122853 Max. : 0.003244 > quantile(boot.out$t[,2], c(.025,.975)) # 95% CI for Murder slope 2.5% 97.5% -0.3452561 -0.1883249 > quantile(boot.out$t[,3], c(.025,.975)) # compare this to boot.ci() answer below 2.5% 97.5% 0.01702089 0.08657885 > confint(lm(Life.Exp ~ Murder + HS.Grad + Frost, data=states)) # parametric CIs for comparison 2.5 % 97.5 % (Intercept) 69.05717472 73.015582905 Murder -0.35700149 -0.209128849 HS.Grad 0.01935043 0.080546980 Frost -0.01183825 -0.001985219 > boot.ci(boot.out, type="perc", index=3) # still doesn't give quite the right answer BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = boot.out, type = "perc", index = 3) Intervals : Level Percentile 95% ( 0.0169, 0.0867 ) Calculations and Intervals on Original ScaleNow, clean up your workspace in preparation for the next and final section, but KEEP the states data frame. We're going to use it again. Resampling a Oneway ANOVA We will begin by adding a factor, census region, to the "states" data frame,
which hopefully you still have left over from the last section. We're going to
change the value labels on the fly because, once again, spaces in the variable
names! Whose idea was that? (Note: state.region is a built-in data object.)
> states$Region = factor(state.region, labels=c("Northeast","South","Midwest","West")) > summary(states) Population Income Illiteracy Life.Exp Murder Min. : 365 Min. :3098 Min. :0.500 Min. :67.96 Min. : 1.400 1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12 1st Qu.: 4.350 Median : 2838 Median :4519 Median :0.950 Median :70.67 Median : 6.850 Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88 Mean : 7.378 3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89 3rd Qu.:10.675 Max. :21198 Max. :6315 Max. :2.800 Max. :73.60 Max. :15.100 HS.Grad Frost Area Region Min. :37.80 Min. : 0.00 Min. : 1049 Northeast: 9 1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985 South :16 Median :53.25 Median :114.50 Median : 54277 Midwest :12 Mean :53.11 Mean :104.46 Mean : 70736 West :13 3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81162 Max. :67.30 Max. :188.00 Max. :566432 > boxplot(Income ~ Region,states) # not shown > with(states, tapply(Income, Region, mean)) Northeast South Midwest West 4570.222 4011.938 4611.083 4702.615 > with(states, tapply(Income, Region, sd)) Northeast South Midwest West 559.0771 605.4505 283.0825 663.9004The relationship we are interested in is Income by Region (in case you didn't guess). It looks like we have some pretty substantial differences here, particularly between the South and the other regions. It also looks like we have heterogeneous variances. Monte Carlo Simulation Let's use a Monte Carlo simulation to get an F.star distribution assuming the
null hypothesis is correct, and the four regions are actually samples from one
and the same population, and assuming normality and homogeneity as well.
> data.star = states$Income > mean.star = mean(data.star) > sd.star = 555.0118 # the pooled standard deviation obtained from aov() > IV.star = factor(rep(c("NE","S","MW","W"), times=c(9,16,12,13))) > R = 1000 > F.star = numeric(R) > for (i in 1:R) { + DV.star = c(rnorm(9, mean=mean.star, sd=sd.star), + rnorm(16, mean=mean.star, sd=sd.star), + rnorm(12, mean=mean.star, sd=sd.star), + rnorm(13, mean=mean.star, sd=sd.star)) # a little silly, but... + F.star[i] = oneway.test(DV.star ~ IV.star, var.eq=T)$statistic + }We should now have our simulated F-distribution (F.star) for df=3,46. I'll plot it and then draw the theoretical F(3,46)-distribution over top of it. > hist(F.star, breaks=30, prob=T, xlim=c(0,7), ylim=c(0,.8)) > points(x<-seq(0,7,.1), y=df(x,3,46), type="b", cex=.6, col="red") > oneway.test(Income~Region, data=states, var.eq=T) One-way analysis of means data: Income and Region F = 4.687, num df = 3, denom df = 46, p-value = 0.006136 > mean(F.star >= 4.687) # p-value [1] 0.006That's as close as we could come given 1000 sims. Let's do it again assuming equal means but not assuming equal variances. For reference: > oneway.test(Income ~ Region, data=states) One-way analysis of means (not assuming equal variances) data: Income and Region F = 4.2638, num df = 3.000, denom df = 22.377, p-value = 0.01596 > R = 1000 > F.star = numeric(R) > for (i in 1:R) { + DV.star = c(rnorm(9, mean=mean.star, sd=559.08), + rnorm(16, mean=mean.star, sd=605.45), + rnorm(12, mean=mean.star, sd=283.08), + rnorm(13, mean=mean.star, sd=663.90)) + F.star[i] = oneway.test(DV.star ~ IV.star, var.eq=T)$statistic + } > hist(F.star, breaks=30, prob=T, xlim=c(0,8), ylim=c(0,1)) > points(x<-seq(0,7,.1), y=df(x,3,46), type="b", cex=.6, col="red") > qf(.95, 3, 46) [1] 2.806845 > quantile(F.star, .95) # okay, maybe not 95% 2.78404 > mean(F.star >= 4.2638) # p-value [1] 0.013 Randomization Test One key assumption of the randomization test, not previously mentioned, is
exchangeability. Basically, this means that a subject who ended up in one group
could just have easily have ended up in some other group. The only way to
guarantee exchangeability is to do random assignment to groups. Thus, the
randomization test is often reserved for true, randomized experiments. With
intact groups, it's hard to say that a Missourian could just have easily been
a Californian, or that California could just as easily have ended up in the
Midwest. (They do have some rather "interesting" earthquakes out there. Just
saying, speaking as a person who used to live there!) Nevertheless, just to
illustrate the method, we're going to assume exchangeability.
> income = states$Income # just making the data a little more accessible > region = states$Region > R = 1000 # 1000 reps (good thing we're not doing pushups!) > results = numeric(R) # a place to store the results > for (i in 1:R) { + permuted.income = sample(income, size=50, replace=F) + results[i] = oneway.test(permuted.income ~ region)$statistic + } > mean(results >= 4.2638) # F = 4.2638 is the result from the real data [1] 0.022Assuming the null hypothesis is true, and permuting (resampling without replacement) the income values 1000 times, we find the probability of getting a result greater than or equal to the observed one is 0.022. The test was done without assuming homogeneity of variance. Let's Try boot()ing It Zieffler et al suggest that a good statistic to bootstrap for an ANOVA is
the effect size, eta-squared. Others have suggested the treatment sum of squares,
the treatment mean square, and the variance of the treatment means. As these are
all functions of F and the degrees of freedom, it seems to me that they should
give the same result as bootstrapping F (degrees of freedom being constants for
any given problem). Let's find out.
> aov(income ~ region) Call: aov(formula = income ~ region) Terms: region Residuals Sum of Squares 4331342 14169750 Deg. of Freedom 3 46 Residual standard error: 555.0118 Estimated effects may be unbalanced > 4331342/3 [1] 1443781 > var(tapply(income, region, mean)) [1] 97939.1First we get some reference values. SST = 4331342. MST = 1443781. VarT = 97939.1. I'm going to store these, because I don't want to have to remember them. > SST = 4331342 > MST = 1443781 > VarT = 97939.1 > Eta2 = 4331342 / (4331342 + 14169750) # eta-sq = 0.2341128 > Fobs = (4331342/3) / (14169750/46) # assuming homogeneity, F = 4.687021I'm having second thoughts about the variance of the treatment means, given that this is an unbalanced design. I think it would work if the design were balanced. Now we have to write a function, to be used by the bootstrap, that calculates all these things. To make life easier, I'm going to get rid of the original data frame, make a new one from just the variables we're using, and then erase those from the workspace. > rm(states) > states = data.frame(income, region) > rm(income, region)Okay, here we go! > SS = function(x) sum(x^2) - sum(x)^2/length(x) # a function for sum of squares > boot.states = function(data,i) { + d = data[i,] + aov.out = aov(d$income ~ states$region) + SST.star = SS(d$income) - SS(aov.out$residuals) + MST.star = SST.star / 3 + VarT.star = var(tapply(d$income, states$region, mean)) + Eta2.star = SST.star / SS(d$income) + F.star = MST.star / (SS(aov.out$residuals)/46) + c(SST.star, MST.star, VarT.star, Eta2.star, F.star) + } > boot.states(states) # looks like our function passes the test [1] 4.331342e+06 1.443781e+06 9.793910e+04 2.341127e-01 4.687020e+00 > bout = boot(data=states, statistic=boot.states, R=1000) > summary(bout$t) V1 V2 V3 V4 V5 Min. : 4453 Min. : 1484 Min. : 3649 Min. :0.0002858 Min. :0.004383 1st Qu.: 460119 1st Qu.: 153373 1st Qu.: 75190 1st Qu.:0.0260488 1st Qu.:0.410097 Median : 881114 Median : 293705 Median :112246 Median :0.0495693 Median :0.799704 Mean :1145724 Mean : 381908 Mean :120906 Mean :0.0632042 Mean :1.088080 3rd Qu.:1533472 3rd Qu.: 511157 3rd Qu.:156216 3rd Qu.:0.0879430 3rd Qu.:1.478482 Max. :9006160 Max. :3002053 Max. :390738 Max. :0.3856425 Max. :9.624990Alright! Drove myself a little crazy with that one for awhile! Until I discovered that the proper ANOVA was not aov.out = aov(d$income ~ d$region). That would be assuming the null hypothesis is false. We are assuming the null hypothesis is true, so we have to do the ANOVA against the original (unbootstrapped) arrangement of region. Same thing for the variances of the treatment means (of course). Zieffler also calculated eta-squared from the original SS Total rather than the bootstrapped one, as I have above. I'm not sure what the justification is for that. Now for the moment of truth. > mean(bout$t[,1] >= SST) [1] 0.019 > mean(bout$t[,2] >= MST) [1] 0.019 > mean(bout$t[,3] >= VarT) [1] 0.031 > mean(bout$t[,4] >= Eta2) [1] 0.009 > mean(bout$t[,5] >= Fobs) [1] 0.009Here's what we've learned. Bootstrapping SST and MST give the same result. That's not surprising since one is a linear transform of the other. The variance of the group means would also give the same result if the design were balanced, because it is MST/n per group in those cases. In the many runs of this I did, the bootstrap of the treatment variances consistently resulted in the highest p-value, in this unbalanced case. The unbalanced design has, appropriately, cost us a little power. Bootstrapping eta-squared gives the same result as SST and MST if the original SS Total is used to calculate it (as per Zieffler et al). If not, it gives the same result as bootstrapping the F statistic (as here and as expected, as it is a linear transform of F). Bootstrapping the F statistic consistently gives a slightly smaller p-value than bootstrapping SST and MST (and Eta2 if the original SS Total is used to calculate it). revised 2016 March 6; updated March 8 |