R Tutorials--Reampling Techniques

RESAMPLING TECHNIQUES

Caveat emptor

Resampling techniques depend upon repeated (re)randomization or simulation of a sample. Computers do not generate random numbers. Since an algorithm is used to produce the results of a function like runif(), these results are technically referred to as pseudorandom. I've done a few casual tests of the R random number generator, and it seems to be very good, but I'm not an expert on pseudorandom number generators. So I will begin with a warning: Computer intensive resampling is only as good as your pseudorandom number generator.

Further Notes

I am scratching the surface of resampling techniques in this tutorial. I don't claim to have extensive knowledge, so the examples here will be fairly simple, just to illustrate what resampling is and how it can be done in R.

Speaking of "how it can be done in R," those of you who may have read this tutorial before it's latest revision have had an earful (eyeful?) of my opinion of the boot() function and its help page. FINALLY, someone has written a good, comprehensible explanation of the function (Zieffler, Harring, and Long, 2011, see refs). I've also found some good examples online. I will try to do it a little more justice in this revision.

Concerning the "sleep" data set, which is used here: These data are actually paired scores (see the help page), although they have often been used as an example for unpaired tests. For the sake of illustrating how things work, it doesn't really make much difference, but I just wanted to let you know that I know these are paired scores. We'll use them both ways.

I will illustrate three basic techniques in this tutorial.

Monte Carlo simulation
Jackknife resampling (randomization test)
Bootstrap resampling

Monte Carlo Estimation of Power

Although the term "resampling" is often used to refer to any repeated random or pseudorandom sampling simulation, when the "resampling" is done from a known theoretical distribution, the correct term is "Monte Carlo" simulation. I will use such a scheme here to demonstrate how power can be estimated for the two-sample t-test using Student's sleep data.

> data(sleep)
> str(sleep)
'data.frame':	20 obs. of  3 variables:
 $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
 $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
> attach(sleep)
> tapply(extra, group, mean)
   1    2 
0.75 2.33 
> tapply(extra, group, sd)
       1        2 
1.789010 2.002249 
> tapply(extra, group, length)
 1  2 
10 10

The experiment is on the effect of a soporific drug. All subjects were measured during a baseline (no-drug) period and again during the drug trial, and the "extra" variable is how much additional sleep time they got over baseline during the drug trial. Group 1 is the control (placebo) condition, and Group 2 is the treatment (drug) condition. To make it easier to keep track of this, I'm going to rename the levels of the "group" factor.

> detach(sleep)                             # NEVER modify an attached data frame!
> sleep$group = factor(sleep$group, labels=c("control","treatment"))
> summary(sleep)
     extra              group          ID   
 Min.   :-1.600   control  :10   1      :2  
 1st Qu.:-0.025   treatment:10   2      :2  
 Median : 0.950                  3      :2  
 Mean   : 1.540                  4      :2  
 3rd Qu.: 3.400                  5      :2  
 Max.   : 5.500                  6      :2  
                                 (Other):8  
> with(sleep, tapply(extra, group, mean))   # just a quick check
  control treatment 
     0.75      2.33 
> attach(sleep)                             # looks like I got it right, so reattach

Here is the t-test (treating the scores as if they were from independent samples).

> t.test(extra ~ group, var.eq=T)

	Two Sample t-test

data:  extra by group
t = -1.8608, df = 18, p-value = 0.07919
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.363874  0.203874
sample estimates:
  mean in group control mean in group treatment 
                   0.75                    2.33

We'll use these results as if they were from a pilot study. We now want to know what we have to do to get a publishable result here (i.e., p < .05), assuming our pilot data reasonably represent nature. We didn't quite reach statistical significance with n = 10 per group. How many subjects/group should we use to get a power of 80%?

The power.t.test() function gives us a way to get some idea of the answer. The function wants to be fed several arguments: n = the number of subjects/group, delta = the anticipated difference between the means, sd = the anticipated pooled standard deviation of the two groups, sig.level = the significance level we desire for the test, type = the type of t-test we will use ("two.sample" is the default), and alternative = the nature of the alternative hypothesis ("two.sided" is the default). Exactly one of those arguments should be set to NULL, and the function will estimate it for us from the other values.

> power.t.test(n=NULL, delta=1.58, sd=1.9, sig.level=.05, power=.8)

     Two-sample t test power calculation 

              n = 23.70004
          delta = 1.58
             sd = 1.9
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

We are being told to use 24 subjects/group. We've made the usual t-test assumptions of sampling from normal distributions and equal variances. But the variances were not quite equal in our samples. The SDs were 1.8 and 2.0. Let's use those values to get an estimate of power from a Monte Carlo simulation. We will simulate drawing samples of 24 from normal distributions with means and SDs equal to those we saw in our sample.

What we are about to do is, perhaps, best done with a script. That way, if something goes wrong, we don't have to re-enter multiple lines into the R Console. However, we will live on the edge for the sake of illustration. To get this to work with minimal effort on our part--always a good thing--we will need to get R to do the same set of calculations many times. This is done with a loop, and in R, loops are most often done using for(). The syntax is "for (counter in 1 to number of simulations)". The statements to be looped through repeatedly then follow enclosed in curly braces. I feel a bit like I'm trying to describe an elephant to someone who's never seen one before. Perhaps it would be best to just show you the elephant.

> R = 1000
> alpha = numeric(R)
> for (i in 1:R) {
+    group1 = rnorm(24, mean=.75, sd=1.8)
+    group2 = rnorm(24, mean=2.33, sd=2.0)
+    alpha[i] = t.test(group1, group2, var.eq=T)$p.value
+ }

The first line defines the number of random simulations we want and stores this value in the semi-official data object that R typically uses for such things, namely R. We've done 1000 sims, because that's the number that is normally done. Some people use R=999, others use 10,000, just because they can. Computers are pretty fast these days. 1000 usually gives a reasonably accurate result. The second line creates a numeric vector with 1000 elements, which will hold the result of each simulation. The third line begins our "for loop": "for i (the counter) going from 1 to 1000", do the following stuff. I.e., do the following stuff 1000 times, keeping track by incrementing the counter, i, at each step. Group 1 is generated. Group 2 is generated. The t-test is done, and the p-value is extracted and stored in "alpha", our designated storage place. In each simulation, the storage occurs in the ith position of the alpha vector. The loop was executed as soon as we typed the close curly brace. Now all that remains is to determine how many of those values are less than .05.

> mean(alpha < .05)
[1] 0.787

To do this, we played a nice little trick. Remember, "alpha<.05" generates a logical vector of TRUEs and FALSEs, in which TRUEs count as ones and FALSEs count as zeros. We took the mean of that vector, which is to say, we calculated by a bit of trickery the proportion of TRUEs. Our simulation tells us we have a 78.7% chance of rejecting the null hypothesis under these conditions, not quite the 80% that we desired. Your results will differ, of course, because we are using a randomizing procedure after all.

Notice this made quite a mess of your workspace.

> ls()
[1] "alpha"  "group1" "group2" "i"      "R"      "sleep"
> rm(alpha, group1, group2, i, R)

You might want to clean up a little bit.

Doing It With boot()

In the "boot" library there is a function, boot(), which supposedly automates this process and makes it more convenient. The library is not loaded at start-up, so first we have to load it.

> library("boot")

I should point out, at least for simple bootstraps, I don't see how this method is in any way "convenient." You still have to write your own functions, even when the function you want is one that is predefined in R. Seems to me that doing it this way can actually be MORE work than what we've just done.

Nevertheless, glutton for punishment that I am, I will see if I can get it to work. Here is the syntax for boot, and good luck to you if you have only the help page to guide you!

boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"),
strata = rep(1,n), L = NULL, m = 0, weights = NULL,
ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ...,
parallel = c("no", "multicore", "snow"),
ncpus = getOption("boot.ncpus", 1L), cl = NULL)

Okay, got it?

Nope! I can't find any way to make boot() actually do this. The help page says the function generates bootstrap replicates, so let's see what that's about.

Update

This might do it, but it took a lot of trickery, and I'm not sure it's right. See the section below on parametric bootstrap for an explanation. (This method seems quite obtuse compared to the method that was presented in the previous section, so I'm not going to put a lot of effort into explaining it!)

> ran.gen = function(data, mle) {
+    c(rnorm(24, mean=.75, sd=1.789), rnorm(24, mean=2.33, sd=2.002))
+ }
> alpha = function(data, i) {
+    t.test(data[1:24], data[25:48], var.eq=T)$p.value
+ }
> fakevect = rnorm(48, mean=1.54, sd=1.9)
> b.out = boot(data=fakevect, statistic=alpha, R=1000, sim="parametric", ran.gen=ran.gen)
> mean(b.out$t < .05)
[1] 0.792

The estimated power for groups of 24 each is 79.2%.

The Bootstrap

Bootstrap resampling makes no assumptions about the populations other than what we can know from the samples. Bootstrapping treats the samples as mini-populations and resamples from them with replacement to get what are called bootstrapped samples. These bootstrapped samples are then used to get p-values, estimates of population parameters, confidence intervals, etc. Above, in the "sleep" example, we acknowledged that the variances of the two samples were not homogeneous, but we still assumed that we were sampling from normal distributions. How did we know that? We didn't. (That's why it's called an assumption!) Let's acknowledge that as well.

Here is a bootstrapped sample from sleep$extra, group 1. (I still have "sleep" attached. If you don't, attach it. If you do, DON'T attach it again!)

> sample(extra[1:10], size=24, replace=T)
 [1]  3.4 -0.1  3.7 -0.1  0.7 -0.2  0.0  0.7  3.4  0.7  3.4  0.8  3.7  3.4  2.0  0.7
[17]  0.0 -0.1 -0.2 -0.1  3.7 -1.2 -0.2 -1.2

Your result will be different because sample() takes a random sample from the specified population (first argument). The critical feature here is the sample is taken with replacement. Let's see how many times we can get an alpha less than .05 if we do this from both groups and calculate a t-test. Once again, use a script window if you're uncertain that you can type this correctly on the first try.

> R = 1000
> alpha = numeric(R)
> for (i in 1:R) {
+    group1 = sample(extra[1:10], size=24, replace=T)
+    group2 = sample(extra[11:20], size=24, replace=T)
+    alpha[i] = t.test(group1, group2, var.eq=T)$p.value
+ }
> mean(alpha < .05)
[1] 0.844

The bootstrap is telling us we're okay here with samples of 24 each if we want a power of at least 80%. This is not necessarily a "different" result than the one we got above, because this is based on random (re)sampling, after all, and we will get a different answer every time we do it. You got a different answer than I did, right? I just ran it three more times, however, and my answers are pretty consistently 83-86%, which I have to assume means we are seeing a difference when we resample from the original data without making the normality assumption. Since we didn't make any assumptions about the parent population, this is called a nonparametric bootstrap.

> detach(sleep)              # and clean up; keep sleep

"Automating" It With boot()

Here's one way you could do the same thing with the boot() function, and I'm going to show you this just to illustrate how the function works, sort of. Bear with me if you know better. The function will require three pieces of information from you: the name of the data set (vector, matrix, or data frame), a function that YOU have written that defines the statistic you want to have bootstrapped, and R. I'll call my function "alpha", give it the "sleep" data frame, and I also have to give it an i, which the function will use internally to keep track of what iteration it's on. (Note: I detached sleep before I did this.)

> alpha = function(sleep, i) {
+    group1 = sample(sleep$extra[1:10], 24, replace=T)
+    group2 = sample(sleep$extra[11:20], 24, replace=T)
+    t.test(group1, group2, var.eq=T)$p.value
}
> boot.out = boot(data=sleep, statistic=alpha, R=1000)
> mean(boot.out$t < .05)
[1] 0.852

The model object, boot.out, contains a fair amount of information. Do names(boot.out) to see it all and then don't worry about most of it. The relevant output, the p-values in this case, are in boot.out$t. That's the equivalent of our alpha vector above.

You should be thinking, that was really a pointless exercise! Why wrap the boot function around the same exact thing we did above without the boot function? What's the point of that? I wanted to show you that you have to write your own function to be used by boot, and that the output of that function, applied to your data, is in t inside the boot.out object. Other than that, you're right. It was quite pointless. All we did was duplicate the action of the previous script, but we let boot store the results for us. There's gotta be more to it than that, right?

The problem that boot() has with this example is that we are trying to draw bootstrap samples that are larger than the original data vector. I've seen a few sources where they get around this by using a script similar to the one earlier in this section. Most sources recommend the parametric bootstrap (i.e., Monte Carlo similulation) for doing power problems.

I might point out, however, that with 24 subjects per group and a two-tailed t-test at alpha of .05, you would need an effect size of about 0.42 (Cohen's d) to find a significant difference between the means. How often could we count on getting an effect size that large or larger from data like these?

> cohensd = function(data, i) {
+    d = data[i,]
+    abs(t.test(d$extra ~ d$group, var.eq=T)$statistic * sqrt(2/10))
+ }
> boot.out = boot(data=sleep, statistic=cohensd, R=1000)
> mean(boot.out$t >= .42)
[1] 0.841

Now we see the boot() function in its element. First, we need to write a function to calculate Cohen's d, and this one is based on the easily verifiable fact that d = t * sqrt(2/n) when there are n subjects per group in a t-test. The function is given two arguments, "data" (which will be fed to the boot() function when that is executed) and "i", the indexing variable. We will tell boot that data=sleep, a data frame. The boot() function will take care of permuting that for us. So we hand over each permuted data frame to a variable called d, and then we use that to calculate a t-test, extract the t-value with $statistic, and then multiply by sqrt(2/10) to convert it to Cohen's d, which is always positive, so the absolute value is taken as well. Then we feed the boot() function our data, our function, and R, and let it do it's thing. And it tells us we have about an 84% chance of getting an effect this large or larger from data like these, which is consistent with the last result. (Phew! It took me a long time to think of that!)

Bootstrapping The t-Test

The idea is to consider the pooled sample to be a mini-population of scores. Presumably, we were sampling from a larger parent distribution, but we know nothing about it for certain. The only information we have about the parent distribution is the sample we've obtained from it. Under the null hypothesis, the two groups, control and treatment, were effectively sampled from the same population. Thus, we pool the two samples to emulate that population as best we can.

Then we simply bootstrap two samples from that mini-population and see how likely it is to get a result more extreme than the one we actually observed. We'll need some sort of statistic to determine "extremeness." The difference between means would be adequate. We'll do that below.

It would actually be a little easier to just let R calculate t-values from our bootstrapped samples. That will require a little more internal effort from our hardware, but since computing power is more than up to the task these days, what do we care? There's also an educational lesson here (students!). The problem with just calculating a t-test and looking up critical values in the back of the book is that, if we haven't met the assumptions of the test, then our calculated value of t will not be distributed theoretically as t, so comparing our calculated t to the tables will be a bit off. Therefore, we will calculate a new sampling distribution for the test statistic by resampling from our mini-population WITH replacement, calculating the desired statistic, and taking a look at what we get. (Notice that since we pooled the samples, we are still sort of making a homogeneity of variance assumption by doing this. We are also beginning with the assumption that the null hypothesis is true.)

> with(sleep, t.test(extra ~ group, var.eq=T)$statistic)   # for reference
        t 
-1.860813 
> scores = sleep$extra                           # the data
> R = 1000                                       # the number of replicates
> t.star = numeric(R)                            # storage for the results
> for (i in 1:R) {
+    group1 = sample(scores, size=10, replace=T)
+    group2 = sample(scores, size=10, replace=T)
+    t.star[i] = t.test(group1, group2, var.eq=T)$statistic
+ }

The (estimated or approximated or simulated) sampling distribution of the t-statistic can be visualized in a histogram, and I will plot the obtained t-value as a point on the x-axis. Finally, I'll draw a line over it showing the theoretical t-distribution for df=18.

> hist(t.star, freq=F)
> points(x=-1.8608, y=0, pch=16)
> lines(x<-seq(from=-4, to=4, by=.1), y=dt(x, df=18), col="red")

We can get some quantiles.

> quantile(t.star, c(.025,.05,.5,.950,.975))
       2.5%          5%         50%         95%       97.5% 
-1.92422717 -1.65896181 -0.07050922  1.62722956  2.09215955

The simulation p-value is obtained as previously.

> mean(t.star <= -1.8608)         # one-tailed
[1] 0.031
> mean(abs(t.star) >= 1.8608)     # two-tailed
[1] 0.065

Our reference two-tailed p-value from the "actual" t-test is 0.079.

Automating It With boot()

Let's bootstrap the actual t-test. (Btw, you can bet I'm typing this into a script window!)

> boot.tee = function(data, i) {
+    d = data[i,]
+    c(t.test(d$extra[1:10], d$extra[11:20], var.eq=T)$statistic,
+      t.test(d$extra[1:10], d$extra[11:20], var.eq=T)$p.value)
+ }
> boot.out = boot(data=sleep, statistic=boot.tee, R=1000)

Here's the explanation. First, I defined a function called boot.tee, which is a function of the data and i. Our boot.tee function will find out what the "data" are when we tell that to boot(). I handed each permuted value of the data to a variable called d. (I don't know if this is standard, but it's a trick I learned in Zieffler, et al, and I thought it was very clever.) Then I created a vector containing two output values: the t-value from the t-test, and p-value from that test. Notice when I did that, I did not use a formula interface, DV~IV. I would do that if I was starting with the assumption that the alternative hypothesis is true, as that would cause group 1 values to be bootstrapped from group 1, and group 2 values to be bootstrapped from group 2 of the original data. In other words, when the data frame is permuted, all columns of it are permuted, so scores (extra) remain paired with the group from whence they came in any given row of the permuted data. Instead, I was starting with the assumption that the null hypothesis is true. I let boot permute the data frame into a random order, jumbling up the rows in the process, and then I tested the first ten scores against the last ten.

If I want to check to see if this function is working properly, I can run it on the original data frame and see if it gives me the right answers.

> boot.tee(sleep)
          t             
-1.86081347  0.07918671

Then I ran the boot() function (and immediately got an error because it's a new day, and I forgot to load the "boot" library). Let's see what we got.

> boot.out

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = sleep, statistic = boot.tee, R = 1000)

Bootstrap Statistics :
       original    bias    std. error
t1* -1.86081347 1.8628096   1.0646010
t2*  0.07918671 0.4121642   0.2867875

I get output for both items I calculated as part of the statistic. First, t1* is the t-value, and second, t2* is the p-value. The first thing I'm told is what the values are when the function is run on the original data ("original"). Then I'm told how far off the bootstrapped values are from that ("bias"), on the average. Finally, I'm given the standard deviation of the 1000 bootstrapped value ("std. error").

> summary(boot.out$t)                  # boot.out$t is a matrix in this case
       V1                  V2           
 Min.   :-4.029412   Min.   :0.0007866  
 1st Qu.:-0.685015   1st Qu.:0.2468949  
 Median :-0.011462   Median :0.4913727  
 Mean   : 0.001996   Mean   :0.4913510  
 3rd Qu.: 0.722583   3rd Qu.:0.7358411  
 Max.   : 3.242622   Max.   :1.0000000

The actual bootstrapped values are inside boot.out in a variable called t, so I asked for a summary of that. V1 is the first thing I calculated (t-value), and V2 is the second thing I calculated (p-value). Notice that the means in both cases are equal to original + bias from the first output. For finer grained info, we'll run quantile(), and we MUST remember to ask for a specific column of the t matrix. Otherwise, we'll be getting quantiles from the t-values and p-values all mixed together. Not useful!

> quantile(boot.out$t[,1], c(.025,.05,.5,.95,.975))
       2.5%          5%         50%         95%       97.5% 
-2.02421421 -1.75560123 -0.01146238  1.77771580  2.10951478

The result is very similar to what we got in t.star, but not exactly the same because this was a new set of permutations of the data. The p-values also differ a bit.

> mean(boot.out$t[,1] < -1.8608)       # one-tailed
[1] 0.038
> mean(abs(boot.out$t[,1]) >= 1.8608)  # two-tailed
[1] 0.081

Note: There is a function in the "boot" library called boot.ci(), but the output doesn't agree with above, and I'm not sure why. As usual, the help page wasn't much help. The explanation is NOT that the output object is a matrix, because it doesn't work (for me) if only one stat is calculated. So I don't use this function! Here is an example of how to use it from stackoverflow.com.

> x <- rnorm(100, 1, .5)
> b <- boot(x, function(u,i) mean(u[i]), R = 999)
> boot.ci(b, type = c("norm", "basic", "perc"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates

CALL : 
boot.ci(boot.out = b, type = c("norm", "basic", "perc"))

Intervals : 
Level      Normal              Basic              Percentile     
95%   ( 0.962,  1.206 )   ( 0.961,  1.205 )   ( 0.963,  1.206 )  
Calculations and Intervals on Original Scale

> quantile(b$t,c(.025,.975))           # for comparison
     2.5%     97.5% 
0.9632601 1.2059236

Seems to work here. Didn't work above. Good luck!

Note: When a test statistic such as a Student's t is bootstrapped, Zieffler, Harring, and Long (2011, see refs) call this bootstrapping using a pivot statistic (p. 163). They imply that it's a bit safer than the parametric bootstrap, which is outlined below, because fewer population parameters must be "known."

Bootstrapping The Difference in Means

Assuming the Null Hypothesis is True

First, we'll combine all the DV values into one "population" and pull both bootstrap samples from that. Our obtained value for the difference in means is 1.58, for reference.

> mean.diff = function(data, i) {
+    d = data[i,]
+    mean(d$extra[11:20]) - mean(d$extra[1:10])
+ }

> mean.diff(sleep)      # test function on original data
[1] 1.58

> boot.out = boot(data=sleep, statistic=mean.diff, R=1000)

> quantile(boot.out$t, c(.025,.05,.5,.95,.975))
    2.5%       5%      50%      95%    97.5% 
-1.72025 -1.44050 -0.04000  1.44050  1.67025 

> mean(boot.out$t >= 1.58)        # one-tailed p-value
[1] 0.035
> mean(abs(boot.out$t) >= 1.58)   # two-tailed p-value
[1] 0.068

The bootstrapped 95% CI on the mean difference values consistent with the null is -1.72 to 1.67. This interval includes the reference value.

In this case, we handed our function the data (to be specified in boot) and an indexing variable i, we passed off the current (ith) bootstrap of the data to d. The data is to be a data frame, so two indices are needed on data[i,]. The first index, i, contains a bootstrapped sample of the rows 1:20. The second is blank, which means all columns. Since the rows of the data have now been randomly permuted, or actually randomly bootstrap sampled, any arbitrary 10 values can be used to calculate the first mean, and the other 10 values the second mean. The easiest way is 11:20 for the first mean and 1:10 for the second mean, or vice versa. (Note: This method is illustrated by Zieffler et al, p. 160.)

The same procedure without boot() would look like this.

> data = sleep
> R = 1000
> results = numeric(R)
> for (i in 1:R) {
+    indices = sample(1:20, size=20, replace=T)
+    d = data[indices,]
+    results[i] = mean(d$extra[11:20]) - mean(d$extra[1:10])
+ }
> quantile(results, c(.025,.05,.5,.95,.975))
    2.5%       5%      50%      95%    97.5% 
-1.66000 -1.40150 -0.00500  1.41050  1.70025
> mean(abs(results) >= 1.58)
[1] 0.067

Not Assuming the Null Hypothesis is True

This time we will take bootstrap samples from the individual groups. For reference, the obtained mean difference is 1.58, and the parametric 95% confidence interval is -0.20 to +3.36.

> mean.diff = function(data, i) {
+    d = data[i,]
+    mean(d$extra[d$group==2]) - mean(d$extra[d$group==1])
+ }

> mean.diff(sleep)      # test function on original data
[1] 1.58

> boot.out = boot(data=sleep, statistic=mean.diff, R=1000, strata=sleep$group)

> quantile(boot.out$t, c(.025,.05,.5,.95,.975))
    2.5%       5%      50%      95%    97.5% 
-0.03000  0.23950  1.61000  2.87000  3.09075

> mean(boot.out$t <= 0)      # one-tailed p-value
[1] 0.028

The bootstrapped 95% CI on the mean difference is -0.03 to 3.09, a little narrower than the parametric value. I've run this many times, and the lower bound of the CI jumps back and forth across 0. Even at R=10000, the lower bound can't quite decide which side of 0 it wants to be on. So this is a close one.

When the order of rows in the data frame is permuted (bootstrapped), the group information remains with the data values. So this time the mean.diff function calculates the difference between a mean of values that came from group 2 and a mean of values that came from group 1. The strata= option is not actually necessary to make the procedure run and give an answer, but it seems to make a slight difference in the width of the CI, so I'm not sure how it's used, but it should be included to tell the procedure that the two groups are being sampled separately. (Note: This method is illustrated by Zieffler et al, p. 190.)

The same procedure without boot() would look like this.

> data = sleep
> R = 1000
> results = numeric(R)
> for (i in 1:R) {
+    indices = sample(1:20, size=20, replace=T)
+    d = data[indices,]
+    results[i] = mean(d$extra[d$group==2]) - mean(d$extra[d$group==1])
+ }
> quantile(results, c(.025,.05,.5,.95,.975))
     2.5%        5%       50%       95%     97.5% 
0.0248750 0.1899048 1.5415862 2.9043119 3.1559799
> mean(results <= 0)
[1] 0.024

Important note: How do you get a poor result from a bootstrap? Use small samples. While I've used a small sample case above to illustrate the methods, you should probably be wary of using bootstrap in such cases.

Permutation and Randomization Tests: The Jackknife

I will use the sleep data to illustrate our next topic as well, which is permutation tests.

The logic behind a permutation test is straightforward. It says, "Okay, these are the twenty (in this example) scores we got, and the way they are divided up between the two groups is one possible permutation of them. There are, in fact...

> choose(20,10)
[1] 184756

...possible permutations (technically combinations--but same idea) of these data into two groups of ten, and each is equally likely to occur if we were to pick one at random out of a hat. Most of these permutations would give no or little difference between the group means, but a few of them would give large differences. How extreme is the obtained case?" In other words, the logic of the permutation test is quite similar to the logic of the t-test. If the obtained case is in the most extreme 5% of possible results, then we reject the null hypothesis of no difference between the means (assuming we were looking at differences between means). The advantage of the permutation test is, it does not make any assumption about normality, and in fact, doesn't make any assumption at all about parent distributions. Furthermore, a permutation test is generally more powerful than a "traditional" nonparametric test.

The disadvantage of a permutation test is the number of permutations that must be generated. Even small samples, as we have here, will choke a typical desktop computer with repetitive calculations. Therefore, instead of generating all possible permutations, we generate only a random sample of them. This procedure is often called a "randomization test" instead of a permutation test. Let's see the elephant, and then I'll explain.

> ### Begin by cleaning up your workspace, if you haven't already.
> R = 1000
> scores = sleep$extra
> t.star = numeric(R)
> for (i in 1:R) {
+    index = sample(1:20, size=10, replace=F)
+    group1 = scores[index]
+    group2 = scores[-index]
+    t.star[i] = t.test(group1, group2, var.eq=T)$statistic
+ }

First, we set the number of sims (replications, resamplings) to 1000. Then we made a copy of the data, which was really unnecessary. Then we established a storage vector for the results of the sims. Then we began looping. Inside the loop, we took a random sample of the vector 1 to 20, without replacement, and stored that in an object called "index". These index values were then used to extract the first resampled group of scores. The second group of scores was extracted using a trick, which in words goes, "for group two take all the scores that are not indexed by the values in index." In other words, put the remaining scores in group two. Then we did the t-test and stored the test statistic (t-value). It now remains to compare those simulated t-values to the one we obtained above when we did the actual t-test.

> mean(abs(t.star) >= 1.8608)          # two-tailed test
[1] 0.072

The resampling scheme tells us that a little more than 7% of the sims gave us a result as extreme or more extreme than the obtained case. This is the p-value resulting from the randomization test. We can see that it is pretty close to the p-value obtained in the actual t-test above (.0794). The difference may be nothing more than random noise. (Note: Once again, your result will be different due to the randomization done by sample().)

The bootstrap method simulates random sampling from a population. The randomization method, sometimes called jackknife resampling, simulates random assignment to groups. It's usually considered the preferred method if your subjects are volunteers or you snatched them out of the hallway during change of classes.

Another Way

There are as many ways to do randomization tests as there are statistics you can think up to assess the difference between the groups. Here is another way. Notice in this case I've use a different, but equally effective, method of resampling as well. I merely rearrange (permute) the scores and use the original grouping variable to index them.

> R = 1000
> scores = sleep$extra
> groups = sleep$group
> mean.diff = numeric(1000)
> for (i in 1:R) {
+    perm.scores = sample(scores, size=20)
+    mean.diff[i] = diff(tapply(perm.scores, groups, mean))
+ }
> tapply(scores, groups, mean)
   1    2 
0.75 2.33 
> 2.33 - .75            # the obtained difference
[1] 1.58
> mean(abs(mean.diff) >= 1.58)
[1] 0.086

The result is similar. We'd get a narrower range of results if the samples were larger.

Paired Scores

"Hey! These are paired scores. Deal with that!" First, let's get some reference info from the standard t-test for paired scores.

> t.test(extra ~ group, data=sleep, paired=T)

	Paired t-test

data:  extra by group
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean of the differences 
                  -1.58

The logic behind randomization in this case is that it is a matter of chance which of the paired scores occurred in the control group and which occurred in the treatment group. In other words, when we look at the sign on the difference scores, it's a coin flip as to whether than sign will be positive or negative. The difference scores are...

> diffs = sleep$extra[1:10] - sleep$extra[11:20]
> diffs
 [1] -1.2 -2.4 -1.3 -1.3  0.0 -1.0 -1.8 -0.8 -4.6 -1.4

...and a lot of them are negative, but the null hypothesis says that's just dumb luck. Given that those are the difference scores, it's a matter of pure chance what sign they have. And there is the randomization we need to justify a randomization test.

> R = 1000                                       # we'll do 1000 sims
> absdiffs = abs(diffs)                          # strip the signs
> results = numeric(R)                           # create a storage vector
> for (i in 1:R) {                               # begin looping
+    sign = sample(c(1,-1), 10, replace=T)       # get 10 random signs
+    results[i] = mean(absdiffs * sign)          # get mean of diffs with new signs
+ }
> quantile(results, c(.025,.975))                # 95% of them fall in what interval?
 2.5% 97.5% 
-1.16  1.14
> mean(abs(results) >= 1.58)                     # two-tailed p-value
[1] 0.006

The result is not too different from what the t-test revealed. You can completely clear your workspace now. We are done with "sleep".

About Those p-Values

Some sources propose that the p-values derived as above are ever so slightly too small and that a correction should be applied that can be calculated like this.

p.corrected = (pR + 1) / (R + 1)

Where p is the probability as calculated above and R is the number of replications. With R = 1000, the largest correction this will ever result in is less than .001, which is undoubtedly well within the boundaries of uncertainty on the p-value anyway. To apply a correction to the third or fourth decimal place of a value that's probably not entirely accurate in the second decimal place strikes me as being a little fussy. But there you have it. Use it as you see fit. With a smaller number of reps, it could be important.

Note on terminology: As we are taking a random sample of possible permutations of the data in the jackknife, jackknife resampling can reasonably be referred to as Monto Carlo resampling, and often is.

Bootstrapped Confidence Intervals

Bootstrap resampling is useful for estimating confidence intervals from samples when the sample is from an unknown (and clearly nonnormal) distribution. The data set "crabs" in the MASS package supplies an example. We will look at the carapace length of blue crabs.

> data(crabs, package="MASS")
> cara = crabs$CL[crabs$sp=="B"]
> summary(cara)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.70   24.85   30.10   30.06   34.60   47.10 
> length(cara)
[1] 100
> qqnorm(cara)                         # not shown

Hmmmm, that looks reasonably normal to me, but who can say for sure? So to get a 95% CI for the mean, we will use a bootstrap scheme rather than methods based on a normal distribution. But just for reference purposes, the normal distribution method would give the following result.

> t.test(cara)$conf.int
[1] 28.68835 31.42765
attr(,"conf.level")
[1] 0.95

The procedure is to take a large number of bootstrap samples and to examine the desired statistic.

> R = 1000
> boot.means = numeric(R)
> for (i in 1:R) {
+    boot.sample = sample(cara, 100, replace=T)
+    boot.means[i] = mean(boot.sample)
+ }
> quantile(boot.means, c(.025,.975))
    2.5%    97.5% 
28.71027 31.37937

The bootstrapped CI is reasonably close to the one obtained by traditional methods, so the normal approximation seems to be reasonable for these data. Extension to more complex problems is straightforward.

Doing It With boot()

I have to admit that using boot() may be just a smidge easier in this case.

> boot.means = function(cara, i) {mean(cara[i])}
> boot.out = boot(data=cara, statistic=boot.means, R=1000)
> quantile(boot.out$t, c(.025,.975))
    2.5%    97.5% 
28.75858 31.39305

But certainly no less cryptic! It may also be useful just too look at the output of boot() directly.

> boot.out

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = cara, statistic = boot.means, R = 1000)

Bootstrap Statistics :
    original   bias    std. error
t1*   30.058 0.015603   0.6902185

The value "original" is the mean of the original data vector "cara". The value "bias" gives the difference between "original" and the mean of the bootstrapped values for this statistic. Clearly, "bias" ought to be close to zero. The value "std. error" is the standard deviation of the bootstrapped means.

Bootstrapping is more usefully applied to statistics like the median, for which there is no formula for CIs when the distribution is not normal.

> boot.medians = function(cara, i) {median(cara[i])}
> boot.out = boot(data=cara, statistic=boot.medians, R=1000)
> quantile(boot.out$t, c(.025,.975))
 2.5% 97.5% 
27.85 32.40 
> boot.out

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = cara, statistic = boot.medians, R = 1000)

Bootstrap Statistics :
    original  bias    std. error
t1*     30.1 0.08145    1.461501

Clean up your workspace. It's a mess!

The Parametric Bootstrap

So far, and in future sections as well, what we've been doing is called the nonparametric bootstrap, because no assumptions are being made about population parameters. We will use some data from the MASS package to illustrate an alternative method called the parametric bootstrap.

> data(birthwt, package="MASS")
> head(birthwt)
   low age lwt race smoke ptl ht ui ftv  bwt
85   0  19 182    2     0   0  0  1   0 2523
86   0  33 155    3     0   0  0  0   3 2551
87   0  20 105    1     1   0  0  0   1 2557
88   0  21 108    1     1   0  0  1   2 2594
89   0  18 107    1     1   0  0  1   0 2600
91   0  21 124    3     0   0  0  0   0 2622

> bw = birthwt[,c("smoke","bwt")]      # we need only these two columns
> bw$smoke = factor(bw$smoke, labels=c("no","yes"))
> summary(bw)
 smoke          bwt      
 no :115   Min.   : 709  
 yes: 74   1st Qu.:2414  
           Median :2977  
           Mean   :2945  
           3rd Qu.:3487  
           Max.   :4990
>rm(birthwt)

The data show birthweights (bwt) in grams of babies born to mothers who smoked and mothers who did not smoke. The obvious point of interest is, what's the difference, if any, in the mean birthweights for the two groups?

Let's look at the data.

> with(bw, tapply(bwt, smoke, mean))
      no      yes 
3055.696 2771.919 
> with(bw, tapply(bwt, smoke, var))
      no      yes 
566492.0 435118.2
> plot(density(bw$bwt[bw$smoke=="yes"]))              # not shown
> lines(density(bw$bwt[bw$smoke=="no"]),col="red")

The means are different by a bit more than 280 grams. The variances are also different, but not too much. The last two lines of code plot overlapping kernel density estimates of the distributions, which show that both of them are mound shaped and not too terribly different from normal. Thus, we will assume that both groups were sampled from the same normal distribution that has parameters equal to the sample mean and variance. (We are assuming the null hypothesis is true, and that our "bwt" variable is sampled from a single normal distribution.)

The first step is to create a DV that consists of standardized values as if they were drawn from a single population with a mean of 0 and a standard deviation of 1. This can be done with the scale() function.

> bw$z.bwt = scale(bw$bwt)

Now we'll investigate the difference between the two groups on this variable.

> with(bw, tapply(z.bwt, smoke, mean))
        no        yes 
 0.1523672 -0.2367869 
> with(bw, diff(tapply(z.bwt, smoke, mean)))
       yes 
-0.3891541

The difference is almost four tenths of a standard deviation. The syntax for the parametric bootstrap is:

boot(data, statistic, R, sim="parametric", ran.gen)

The last two options are new. One of them obviously tells R that we want a parametric bootstrap. The other one is another function that we have to write to generate Monte Carlo samples from the standard normal distribution (assuming it is the normal distribution that we have decided fits our data).

bwt.gen = function(data, mle) {
   rnorm(n=length(data))
}

I have no idea why the mle is necessary, but it is. As usual, we also need the function that calculates our test statistic.

mean.diff = function(data, i) {
   mean(data[1:115]) - mean(data[116:189])
}

Since we've generated a random vector of standard scores in bwt.gen, I just took the difference in means between the first 115 of them (representing nonsmokers) and the last 74 of them (representing smokers). Once those functions are executed and in the workspace, we can run boot().

> boot.out = boot(data=bw$z.bwt, statistic=mean.diff, R=1000, sim="parametric", ran.gen=bwt.gen)

The results are in the usual place.

> boot.out

PARAMETRIC BOOTSTRAP

Call:
boot(data = bw$z.bwt, statistic = mean.diff, R = 1000, sim = "parametric", 
    ran.gen = bwt.gen)

Bootstrap Statistics :
     original    bias    std. error
t1* 0.9683415 -0.965374   0.1527679

> summary(boot.out$t)
       V1           
 Min.   :-0.553122  
 1st Qu.:-0.099277  
 Median : 0.006685  
 Mean   : 0.002968  
 3rd Qu.: 0.104059  
 Max.   : 0.504965  

> quantile(boot.out$t, c(.025,.975))   # does not contain 0.3891541, so yay!
      2.5%      97.5% 
-0.3040931  0.3042744 

> mean(abs(boot.out$t) >= 0.3891541)   # two-tailed p-value
[1] 0.011

For the record, the parametric p-value from the t-test is .0087.

It seems to me that "parametric bootstrap" is just a new name for Monte Carlo simulation. How are they different?

Bootstrapping a Regression Analysis

First, thanks to John Fox for a very clear example of how to use boot() here, which I found in a document posted here. Here is where the boot() function is going to shine, I have to admit. We will look at a problem that we also examine in the Multiple Regression tutorial, in which we attempt to predict life expectancy by U.S. state using various predictors (1977 data).

> states = state.x77
> states = as.data.frame(states)                           # must be a data frame
> colnames(states)[c(4,6)] = c("Life.Exp","HS.Grad")       # fix some variable names
> lm(Life.Exp ~ Murder + HS.Grad + Frost, data=states)     # for reference

Call:
lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = states)

Coefficients:
(Intercept)       Murder      HS.Grad        Frost  
  71.036379    -0.283065     0.049949    -0.006912

First, we define a function to get the statistic(s) we want, which are the regression coefficients.

> boot.states = function(states, i) {
+    mod = lm(Life.Exp ~ Murder + HS.Grad + Frost, data=states[i,])
+    coefficients(mod)
+ }

Once again, the data are not just "states" but states[i,] from the ith iteration of the bootstrap. Now it remains to run boot() and examine the output.

> boot.out = boot(data=states, statistic=boot.states, R=1000)
> boot.out

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = states, statistic = boot.states, R = 1000)

Bootstrap Statistics :
        original        bias    std. error
t1* 71.036378813 -0.1117470290 1.111679000       # intercept
t2* -0.283065168  0.0050440366 0.040044342       # Murder slope
t3*  0.049948704  0.0009847503 0.017462937       # HS.Grad slope
t4* -0.006911735  0.0002284539 0.003021178       # Frost slope

> summary(boot.out$t)
       V1              V2                V3                 V4           
 Min.   :66.42   Min.   :-0.3927   Min.   :0.002082   Min.   :-0.016386  
 1st Qu.:70.29   1st Qu.:-0.3056   1st Qu.:0.039119   1st Qu.:-0.008695  
 Median :71.03   Median :-0.2824   Median :0.050579   Median :-0.006672  
 Mean   :70.92   Mean   :-0.2780   Mean   :0.050933   Mean   :-0.006683  
 3rd Qu.:71.69   3rd Qu.:-0.2548   3rd Qu.:0.061776   3rd Qu.:-0.004735  
 Max.   :73.49   Max.   :-0.1007   Max.   :0.122853   Max.   : 0.003244

> quantile(boot.out$t[,2], c(.025,.975))         # 95% CI for Murder slope
      2.5%      97.5% 
-0.3452561 -0.1883249 
> quantile(boot.out$t[,3], c(.025,.975))         # compare this to boot.ci() answer below
      2.5%      97.5% 
0.01702089 0.08657885 

> confint(lm(Life.Exp ~ Murder + HS.Grad + Frost, data=states))      # parametric CIs for comparison
                  2.5 %       97.5 %
(Intercept) 69.05717472 73.015582905
Murder      -0.35700149 -0.209128849
HS.Grad      0.01935043  0.080546980
Frost       -0.01183825 -0.001985219

> boot.ci(boot.out, type="perc", index=3)        # still doesn't give quite the right answer
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot.out, type = "perc", index = 3)

Intervals : 
Level     Percentile     
95%   ( 0.0169,  0.0867 )  
Calculations and Intervals on Original Scale

Now, clean up your workspace in preparation for the next and final section, but KEEP the states data frame. We're going to use it again.

Resampling a Oneway ANOVA

We will begin by adding a factor, census region, to the "states" data frame, which hopefully you still have left over from the last section. We're going to change the value labels on the fly because, once again, spaces in the variable names! Whose idea was that? (Note: state.region is a built-in data object.)

> states$Region = factor(state.region, labels=c("Northeast","South","Midwest","West"))
> summary(states)
   Population        Income       Illiteracy       Life.Exp         Murder      
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96   Min.   : 1.400  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12   1st Qu.: 4.350  
 Median : 2838   Median :4519   Median :0.950   Median :70.67   Median : 6.850  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88   Mean   : 7.378  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89   3rd Qu.:10.675  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60   Max.   :15.100  
    HS.Grad          Frost             Area              Region  
 Min.   :37.80   Min.   :  0.00   Min.   :  1049   Northeast: 9  
 1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985   South    :16  
 Median :53.25   Median :114.50   Median : 54277   Midwest  :12  
 Mean   :53.11   Mean   :104.46   Mean   : 70736   West     :13  
 3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81162                 
 Max.   :67.30   Max.   :188.00   Max.   :566432

> boxplot(Income ~ Region,states)                # not shown
> with(states, tapply(Income, Region, mean))
Northeast     South   Midwest      West 
 4570.222  4011.938  4611.083  4702.615 
> with(states, tapply(Income, Region, sd))
Northeast     South   Midwest      West 
 559.0771  605.4505  283.0825  663.9004

The relationship we are interested in is Income by Region (in case you didn't guess). It looks like we have some pretty substantial differences here, particularly between the South and the other regions. It also looks like we have heterogeneous variances.

Monte Carlo Simulation

Let's use a Monte Carlo simulation to get an F.star distribution assuming the null hypothesis is correct, and the four regions are actually samples from one and the same population, and assuming normality and homogeneity as well.

> data.star = states$Income
> mean.star = mean(data.star)
> sd.star = 555.0118              # the pooled standard deviation obtained from aov()
> IV.star = factor(rep(c("NE","S","MW","W"), times=c(9,16,12,13)))
> R = 1000
> F.star = numeric(R)
> for (i in 1:R) {
+    DV.star = c(rnorm(9, mean=mean.star, sd=sd.star),
+                rnorm(16, mean=mean.star, sd=sd.star),
+                rnorm(12, mean=mean.star, sd=sd.star),
+                rnorm(13, mean=mean.star, sd=sd.star))    # a little silly, but...
+    F.star[i] = oneway.test(DV.star ~ IV.star, var.eq=T)$statistic
+ }

We should now have our simulated F-distribution (F.star) for df=3,46. I'll plot it and then draw the theoretical F(3,46)-distribution over top of it.

> hist(F.star, breaks=30, prob=T, xlim=c(0,7), ylim=c(0,.8))
> points(x<-seq(0,7,.1), y=df(x,3,46), type="b", cex=.6, col="red")

They are very similar, which is not surprising given the way F.star was created. For reference:

> oneway.test(Income~Region, data=states, var.eq=T)

	One-way analysis of means

data:  Income and Region
F = 4.687, num df = 3, denom df = 46, p-value = 0.006136

> mean(F.star >= 4.687)                # p-value
[1] 0.006

That's as close as we could come given 1000 sims. Let's do it again assuming equal means but not assuming equal variances. For reference:

> oneway.test(Income ~ Region, data=states)

	One-way analysis of means (not assuming equal variances)

data:  Income and Region
F = 4.2638, num df = 3.000, denom df = 22.377, p-value = 0.01596

> R = 1000
> F.star = numeric(R)
> for (i in 1:R) {
+    DV.star = c(rnorm(9, mean=mean.star, sd=559.08),
+                rnorm(16, mean=mean.star, sd=605.45),
+                rnorm(12, mean=mean.star, sd=283.08),
+                rnorm(13, mean=mean.star, sd=663.90))
+    F.star[i] = oneway.test(DV.star ~ IV.star, var.eq=T)$statistic
+ }

> hist(F.star, breaks=30, prob=T, xlim=c(0,8), ylim=c(0,1))
> points(x<-seq(0,7,.1), y=df(x,3,46), type="b", cex=.6, col="red")

Looks like we might be a little heavier in the tails this time.

> qf(.95, 3, 46)
[1] 2.806845
> quantile(F.star, .95)                # okay, maybe not
    95% 
2.78404

> mean(F.star >= 4.2638)               # p-value
[1] 0.013

Randomization Test

One key assumption of the randomization test, not previously mentioned, is exchangeability. Basically, this means that a subject who ended up in one group could just have easily have ended up in some other group. The only way to guarantee exchangeability is to do random assignment to groups. Thus, the randomization test is often reserved for true, randomized experiments. With intact groups, it's hard to say that a Missourian could just have easily been a Californian, or that California could just as easily have ended up in the Midwest. (They do have some rather "interesting" earthquakes out there. Just saying, speaking as a person who used to live there!) Nevertheless, just to illustrate the method, we're going to assume exchangeability.

> income = states$Income          # just making the data a little more accessible
> region = states$Region
> R = 1000                        # 1000 reps (good thing we're not doing pushups!)
> results = numeric(R)            # a place to store the results
> for (i in 1:R) {
+    permuted.income = sample(income, size=50, replace=F)
+    results[i] = oneway.test(permuted.income ~ region)$statistic
+ }
> mean(results >= 4.2638)         # F = 4.2638 is the result from the real data
[1] 0.022

Assuming the null hypothesis is true, and permuting (resampling without replacement) the income values 1000 times, we find the probability of getting a result greater than or equal to the observed one is 0.022. The test was done without assuming homogeneity of variance.

Let's Try boot()ing It

Zieffler et al suggest that a good statistic to bootstrap for an ANOVA is the effect size, eta-squared. Others have suggested the treatment sum of squares, the treatment mean square, and the variance of the treatment means. As these are all functions of F and the degrees of freedom, it seems to me that they should give the same result as bootstrapping F (degrees of freedom being constants for any given problem). Let's find out.

> aov(income ~ region)
Call:
   aov(formula = income ~ region)

Terms:
                  region Residuals
Sum of Squares   4331342  14169750
Deg. of Freedom        3        46

Residual standard error: 555.0118
Estimated effects may be unbalanced
> 4331342/3
[1] 1443781
> var(tapply(income, region, mean))
[1] 97939.1

First we get some reference values. SST = 4331342. MST = 1443781. VarT = 97939.1. I'm going to store these, because I don't want to have to remember them.

> SST = 4331342
> MST = 1443781
> VarT = 97939.1
> Eta2 = 4331342 / (4331342 + 14169750)     # eta-sq = 0.2341128
> Fobs = (4331342/3) / (14169750/46)        # assuming homogeneity, F = 4.687021

I'm having second thoughts about the variance of the treatment means, given that this is an unbalanced design. I think it would work if the design were balanced. Now we have to write a function, to be used by the bootstrap, that calculates all these things. To make life easier, I'm going to get rid of the original data frame, make a new one from just the variables we're using, and then erase those from the workspace.

> rm(states)
> states = data.frame(income, region)
> rm(income, region)

Okay, here we go!

> SS = function(x)  sum(x^2) - sum(x)^2/length(x)     # a function for sum of squares
> boot.states = function(data,i) {
+    d = data[i,]
+    aov.out = aov(d$income ~ states$region)
+    SST.star = SS(d$income) - SS(aov.out$residuals)
+    MST.star = SST.star / 3
+    VarT.star = var(tapply(d$income, states$region, mean))
+    Eta2.star = SST.star / SS(d$income)
+    F.star = MST.star / (SS(aov.out$residuals)/46)
+    c(SST.star, MST.star, VarT.star, Eta2.star, F.star)
+ }
> boot.states(states)             # looks like our function passes the test
[1] 4.331342e+06 1.443781e+06 9.793910e+04 2.341127e-01 4.687020e+00
> bout = boot(data=states, statistic=boot.states, R=1000)
> summary(bout$t)
       V1                V2                V3               V4                  V5          
 Min.   :   4453   Min.   :   1484   Min.   :  3649   Min.   :0.0002858   Min.   :0.004383  
 1st Qu.: 460119   1st Qu.: 153373   1st Qu.: 75190   1st Qu.:0.0260488   1st Qu.:0.410097  
 Median : 881114   Median : 293705   Median :112246   Median :0.0495693   Median :0.799704  
 Mean   :1145724   Mean   : 381908   Mean   :120906   Mean   :0.0632042   Mean   :1.088080  
 3rd Qu.:1533472   3rd Qu.: 511157   3rd Qu.:156216   3rd Qu.:0.0879430   3rd Qu.:1.478482  
 Max.   :9006160   Max.   :3002053   Max.   :390738   Max.   :0.3856425   Max.   :9.624990

Alright! Drove myself a little crazy with that one for awhile! Until I discovered that the proper ANOVA was not aov.out = aov(d$income ~ d$region). That would be assuming the null hypothesis is false. We are assuming the null hypothesis is true, so we have to do the ANOVA against the original (unbootstrapped) arrangement of region. Same thing for the variances of the treatment means (of course). Zieffler also calculated eta-squared from the original SS Total rather than the bootstrapped one, as I have above. I'm not sure what the justification is for that. Now for the moment of truth.

> mean(bout$t[,1] >= SST)
[1] 0.019
> mean(bout$t[,2] >= MST)
[1] 0.019
> mean(bout$t[,3] >= VarT)
[1] 0.031
> mean(bout$t[,4] >= Eta2)
[1] 0.009
> mean(bout$t[,5] >= Fobs)
[1] 0.009

Here's what we've learned. Bootstrapping SST and MST give the same result. That's not surprising since one is a linear transform of the other. The variance of the group means would also give the same result if the design were balanced, because it is MST/n per group in those cases. In the many runs of this I did, the bootstrap of the treatment variances consistently resulted in the highest p-value, in this unbalanced case. The unbalanced design has, appropriately, cost us a little power. Bootstrapping eta-squared gives the same result as SST and MST if the original SS Total is used to calculate it (as per Zieffler et al). If not, it gives the same result as bootstrapping the F statistic (as here and as expected, as it is a linear transform of F). Bootstrapping the F statistic consistently gives a slightly smaller p-value than bootstrapping SST and MST (and Eta2 if the original SS Total is used to calculate it).

revised 2016 March 6; updated March 8