Bootstrapping confidence intervals

Another resampling technique that is used commonly in the biological sciences is bootstrapping. The bootstrap can be used in a very similar fashion to randomization testing, but it is also a suitable method for calculating confidence intervals. The procedure you've learned in introductory biostatistics was first developed by Jerzy Neyman and Karl Pearson in the 1930's, and uses the formula:

x̄ ± t s

where x̄ is the sample mean, t is a critical value obtained from a t-distribution, and s is the standard error. The product t s is called the "uncertainty" of the estimate of x̄, and the upper limit is thus the uncertainty added to x̄, and the lower limit is the uncertainty subtracted from x̄.

This simple procedure works fine in many cases, but not always. Bootstrapping works fine in cases where the Neyman-Pearson method is appropriate, but also works well in cases where Neyman-Pearson fails. We'll focus on a case today where either could be used, and will move on next time to a case where bootstrapping is a much better alternative.

The most important difference between bootstrapping and the randomization testing procedure we used previously is that bootstrapping involves resampling from our data with replacement. This means that in each iteration we can include some observations more than once, and not include others at all. The sample of randomly selected data points is the "bootstrap sample" for that iteration. The VBA code we write will be very similar to last week's randomization test macro, in that we'll take our recorded macro and modify it to loop through the operations 1,000 times and record an observed statistic each time through. But, the setup in Excel will be quite different because we won't be able to simply shuffle the data by sorting on a column of random numbers. Instead, we will use a method that will allow us to randomly select a bootstrap sample with replacement from our observed data each iteration.

If you're curious, the "bootstrap" was one of the earliest applications of computationally intensive methods in statistics. It was a very different approach than traditional methods of working with analytical equations to derive formulas for confidence intervals. Since it was a novel method of solving some particularly thorny statistical problems, the developers of the method likened it to pulling yourself up by your bootstraps when everything else you've tried has failed. Thus the name.

Bootstrapping in Excel

Step 1. Download this file and save it where you're saving class files.

Open the file and you'll find that the first sheet, SockeyeColor, has the data for skin color of sockeye salmon that we worked with last time (we won't worry about the kokanee this time).

The first column is an ID number that we'll use in our bootstrapping to make random selection easy. The data for skin color are in column B, and the mean for the 16 measurements is in cell B19. It is this mean we will be finding a 95% confidence interval for.

Step 2. Generate random sockeye numbers.

The way we will select our bootstrap sample is to generate sockeye numbers at random that will match the numbers in column A.

This selects the salmon numbers in our bootstrap sample, now we need to use these to get the skin color measurements.

Step 3. Select skin colors for the bootstrap sample.

To get the skin color measurements for the randomly selected salmon:

You now generate a bootstrap sample of skin colors each time the sheet recalculates. Double check that skin colors in the bootstrap sample match the random salmon numbers (that is, if the random salmon number is 1 you get a color of 0.98 returned for it).

Note that we're using a column to randomly generate the numbers, and another to return the skin color data. To make this more compact would could replace d2 in the lookup() function with randbetween(1,16), and we would only need one column to generate the bootstrap sample - in other words, if we used =lookup(randbetween(1,16), A$2:A$17, B$2:B$17) we could get our skin color for a randomly selected salmon in a single cell. We didn't do it that way to make the program easier to debug - we can see that the random numbers are being generated correctly, and that the correct skin colors are being returned. If we put the random numbers into the lookup() function we would get only the skin colors, and wouldn't be able to tell which salmon was being selected, or whether the right values were being returned for the random salmon number generated. Separating them makes the spreadsheet a little more complicated, but makes errors easier to spot.

Step 4. Calculate the mean of the bootstrap sample.

Label the mean for the bootstrap sample in D19, and calculate it in E19.

The mean of the bootstrap sample changes every time a new random sample is selected, and we will record this mean each for each of the 1000 iterations we use in our program.

Step 5. Name a column to receive means of repeated bootstrap samples.

In G1 type "Bootstrap means".

At this point your sheet should look like this - don't be concerned if the bootstrap sample is different, it's a random selection from the data and no two will be alike.

Step 6. Write the macro.

To get to the VBA macro editor we need to enable a new tab on our ribbon...

You should now have a new tab called "Developer" in the ribbon ("developer" refers to software developer, aka programmer).

If you switch to the Developer tab the button on the far left says "Visual Basic" - click on this to open the VBA editor.

The scripts we write are functions that perform specific tasks, which Visual Basic calls subroutines. It is possible to have more than one subroutine needed to perform a task. Scripts can be associated with a specific worksheet (in which case the cell references are assumed to apply to that worksheet, regardless of which sheet is active when the script is run), to the workbook (in which case the references are assumed to apply to the active sheet of the workbook), or they can be part of a module (which will work with whatever the currently active workbook and worksheet is).

We will put our scripts into a single Module for this exercise - select Insert → Module and you'll see that a Modules folder has appeared, with Module1 listed within it. Double-click on Module1 to make sure it's open and activated, and you can write your code in the blank editor window that opens up.

Before writing code, let's write the set of steps that need to be accomplished by this macro - this is called "pseudo-code", in that it lays out the way that the program will work, but uses natural language instead of computer code. Once we know what we want to do it's much easier to write a program to do it. To get a bootstrap confidence interval we need to:

That's actually it - Excel recalculates when you enter a value in a cell, so writing entering the bootstrap mean into a cell in column G causes a new set of random salmon numbers to be generated, and the mean is calculated for them each time.

The program that does this looks like this:

Script

Write this into your blank module. What does it do? One row at a time...

And that's all there is to it! Save your work!

Step 7. Debug.

Put the SockeyeColor sheet and VBA code window side by side so that you can see both - you need to be able to see column G, where the bootstrap means are recorded, as well as the VBA window for this to be useful to you.

Once you can see both the macro and the worksheet, select the VBA code window to activate it and hit F8 repeatedly to step through the commands in the macro. You should see that each time you run through the loop a new bootstrap mean is entered in the next row of column G, below the previous one.

Running through all 1000 of the runs through the loop by hitting F8 is a little tedious, so once you're sure it's working you can add a break point to the code window, like so:

Breakpoint

The red dot in the gray band on the left side of the window is the breakpoint - enter it by clicking next to the Columns("G" & i+1).Sort line where the red dot appears in my example. You can now click the Continue button:

Continue 

This will cause the program to run normally until it hits your break point, and then it will pause. We've paused at the sorting step, so if you hit F8 again you'll see that the values in G are now in ascending order.

With only a few more lines to go you can step through the rest of them with F8, and the 25th and 975th means will be entered into B20 and B21, respectively, and our 95% CI bootstrap is completed.

Click on the red dot to remove the breakpoint so that the program will run without interruption.

Step 8. Run the macro.

Once you've debugged and are sure that everything is running correctly, switch to your SockeyeColor sheet and type ctrl+shift+B. When it finishes you'll have 1000 bootstrap means in column G. It should only take a few seconds to run.

If it doesn't run, switch to the View tab, and in the Macros menu select "View Macros". Then select the BootstrapCI macro from the list, and click the "Options..." button. Click into the box in the "Shortcut Key" option, hold down the Shift key and enter a B. You should now see Ctrl+Shift+B as the shortcut key for the macro. Click OK, click "Cancel" to close the macro editor window, and then try running the macro again.

Assignment

Save your work - we'll use this spreadsheet for part 2 and 3 over the next couple of class meetings.