Bootstrap in Picture


I was waiting in line at my local bank recently and spotted an urn filled with coins on a table in the corner of the lobby. When I got to the head of the line, I inquired about it and the clerk told me the urn was part of a contest among some of the bank employees. The objective of the contest was to guess the average mint year among all the coins in the urn without going over the true average. I explained to the teller that I had an unhealthy interest in these sorts of contests and then asked for one additional clarification about the rules: could the contestants pick up a few of the coins to inspect them before submitting a guess? The clerk shrugged and said she didn’t see any problem with that.

The contest piqued my interests because it was somewhat unusual. Most guessing contests involve estimating the number of jellybeans in a jar or the number of snakes on a plane. What intrigued me about this contest was that it involved estimating the mean with the added twist that any guess over the true average would be disqualified.

Although I couldn’t participate in the bank contest directly, I decided to recreate my own contest and describe how I arrived at a guess. This post is mainly an excuse for me to expound about a resampling technique called the bootstrap. The bootstrap is an integral component of my data analysis toolbox. It provides a broadly-applicable method to manage uncertainty in complex statistical models. I find the bootstrap easiest to grok through pictures supplemented with words and math. This post attempts to conjoin all 3 methods to provide an overview of the technique using the coin contest I encountered as an example.1


In many scenarios, it’s too expensive or in some way infeasible to draw many random samples from a population in order to find an underlying sampling distribution for a statistic. The bootstrap method belongs to a family of resampling techniques that attempts to address problems of inference with limited data. The overarching idea is to use a single sample as a proxy to approximate the sampling distribution of a statistic under the assumption that the sample is a model of the population. Once the sampling distribution has been approximated, it can be used for statistical inference, such as in my coin contest.


To recreate the contest, I gathered all the spare coins I could find, 3128 in total, then placed them into a makeshift urn:

From the urn, I picked up a sample of 55 coins and recorded the mint year of each coin. Here’s the resulting empirical CDF of the mint years for the 55 coin sample:

Let $\theta$ denote a population parameter, the population mean in this example, which is the average mint year among all the coins in the urn. Let $\omega \in \Omega$ denote a single simple random sample drawn from the probability space, $\Omega$ in the urn. Let $A = \{ \omega_1, \omega_2, \dots, \omega_k \} \subset \Omega $ denote an event, which represents the sample of coins of size $k$ where $\omega_i$ is an atomic event. The realization of $A$ in my coin contest is shown in the above CDF, where $k=55$.

In the first phase of the bootstrap, $n$ bootstrap samples, $X_1^{*}, X_2^{*}, \dots, X_n^{*}$, each of size $k$ are generated by sampling with replacement from $A$. This process is depicted below where the event $A$ is represented as a histogram and each arrow shows a random variate drawn from $A$ and placed into a given bootstrap sample:

In the next phase, a bootstrap statistic, $\widehat{\theta}{}_i^{*}$ is calculated for the $i$th bootstrap sample. Since $A$ is a SRS from the urn, it follows that the bootstrap distribution is an approximation of the sampling distribution if $A$ is a reasonable approximation of the population of coins in the urn. An estimate of $\widehat{\theta}$ is then given by the expected value of the bootstrap distribution:

The bias of the original estimator is $\widehat{\theta} - \theta$. If $A$ is a reasonable model of the population, the expected value of the bootstrap distribution can be used to estimate bias via substitution:

$$ bias \, (\widehat{\theta}) \approx \widehat{\theta} - \theta \approx \mathbb{E}[\widehat{\theta}{}^{*}] - \widehat{\theta} $$

Since the sample of coins I picked from the urn is small, I made the decision to bias correct the estimator. This choice can improve accuracy, but often at the expense of increased variance. This is a bias-variance tradeoff.


Submitting a bias corrected estimate of $\widehat{\theta}$ as a guess is unwise. If the estimate turned out to be greater than $\theta$, I would be disqualified. To help safeguard against an over-estimate, the bootstrap can be used to construct confidence intervals (CIs) around the estimator. There are many different methods for generating CIs for bootstrap estimations. For this problem, I chose the bootstrap-t method because it’s applicable to a wide range of problems and also straightforward to calculate, albeit computationally intensive.

The bootstrap-t method starts by generating a $t^{*}$ statistic, sometimes called a pivot, by studentizing each bootstrap:

$$ t_i^{*} = \frac{\widehat{\theta}{}_i^{*} - \widehat{\theta}}{se(\widehat{\theta}{}_i^{*})} $$

Since the estimator is the average mint year for the $i$th bootstrap, the standard error (se) of $\widehat{\theta}$ is given by $s_i^{*} \sqrt{n}^{-1}$. The mean is a special case where the standard error is known. However, for many statistics, the standard error must be approximated empirically using an additional round of bootstrapping. In these cases, the process begins anew by generating sub-bootstrap samples for each original bootstrap sample. For the $i$th bootstrap sample, sub-bootstrap samples, $X_{i,\,1}, X_{i,\,2}, \dots, X_{i,\,n}$, are generated, where $X_{i,\,j}$ is the index of a given sub-sample. In the illustration below, the gray sub-figure represents the first round of bootstrapping. For each bootstrap sample, an additional round of bootstrapping is performed drawing from one of the original bootstrap samples shown as a black box in the sub-figure:

These sub-bootstraps provide a method to approximate the standard error of the original bootstrap distribution. The empirical standard deviation of the sub-bootstraps can now be used to approximate the standard error of the estimator, $\widehat{\theta}{}^*$. Hence, the boostrap-t statistic can now be calculated for each bootstrap to generate the bootstrap distribution of $t^{*}$. From this distribution, confidence intervals for $\widehat{\theta}$ can then be constructed where $\alpha \,/\, 2$ is a given quantile from the $t^{*}$ bootstrap distribution:

$$ [\widehat{\theta} - se(\widehat{\theta}{}^{*}) t^{*}_{\alpha\, /\, 2}, \widehat{\theta} - se(\widehat{\theta}{}^{*}) t^{*}_{1 - \alpha \,/\, 2} ]$$


In my coin contest, I recorded the mint dates of the coins in my sample as well as the mint dates of all the coins in the urn. In a real scenario, this parameter, the average mint year in the urn, would be unknown. However, calculating this parameter retrospectively reveals insight into the performance of the technique. Here’s a plot of the performance of many bootstrap replicates as a function of sample size:

The x-axis shows the sample size used in different bootstrap experiments and the y-axis shows mint years. The black horizontal line is the true average mint year of all coins in the urn. The slope of this line is zero because the true mint average in the urn is fixed irrespective of sample size. The blue and red splines show the bias corrected estimate and one-sided confidence band obtained by bootstrapping different sample sizes. The results of bootstrapping my original sample of 55 coins is shown in the plot as two points connected by a dotted line. The blue point represents my bias corrected best estimate for $\theta$, while the red point is my submission guess for the contest. The true average mint year of coins in the urn was 1995.9 years; my corrected estimate was 1996.2 years; and my submission was 1997.8 years. My submission was off by 1.9 years and my estimate was off by 0.3 years. Given that only a single sample of small size was used for inference, the bootstrap produces fairly accurate results.


The bootstrap is a powerful tool, but it shouldn’t be applied carte blanche. This technique can perform poorly when approximating the sampling distribution of a parameter on the boundary of the parameter space; for example, an extreme order statistic. In other cases, when assumptions of smoothness or bounded variance do not hold, the bootstrap can also exhibit poor performance. Two alternatives to the bootstrap are the m-out-of-n bootstrap and subsampling, both of which can sometimes exhibit better performance in situations where the naive bootstrap fails.

Another problem arises not from the bootstrap directly, but from bootstrapping a sample that is a poor model of the population. This problem is apparent in the above plot of my coin estimation. When the sample size becomes too small, the estimate and confidence band can deviate wildly. An extreme case of this problem occurs when the sample size is only a single value. Despite these issues, the bootstrap is a staple technique in my data analysis toolbox.


I’ve used a one-sided bootstrap-t function to obtain the results shown in the plots of this post. Here’s my function called boot_upper():

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
Author: Seth Brown
Description: Bootstrap-t function
Date: 2013-08-17
Python 2.7.5

from collections import namedtuple
import scipy as sp
from scipy.stats.mstats import mquantiles

def boot_upper(vec, fn, n_boots=2000, n_inboots=200, alpha=0.05, bias_corr=False):
    """ Non-parametric bootstrap-t with one-sided confidence interval

        vec: ndarray, original sample
        fn: function, statistic to bootstrap
        n_boots: int, number of outer bootstrap iterations
        n_inboots: int, number of inner bootstrap iterations
        alpha: float, alpha level
        bias_corr: bool, bias correction-defaults to False
    sample_size = len(vec)
    boot_stats = sp.zeros(n_boots)
    emp_stds = boot_stats.copy()
    for i in xrange(n_boots):
        sample_idxs = sp.random.randint(0, sample_size, size=sample_size)
        boot_sample = vec[sample_idxs]
        boot_stats[i] = fn(boot_sample)
        inboot_stats = sp.zeros(n_inboots)
        for j in xrange(n_inboots):
            insample_idxs = sp.random.randint(0, sample_size, size=sample_size)
            inboot_sample = boot_sample[insample_idxs]
            inboot_stats[j] = sp.std(inboot_sample, ddof=1)
        emp_stds[i] = sp.std(inboot_stats)
    stat = fn(vec)
    t_stats = sp.true_divide((boot_stats - stat), emp_stds)
    t_l = mquantiles(t_stats, [alpha])[0]
    std_err = sp.std(boot_stats)
    exp_stat = fn(boot_stats)
    bias = exp_stat - stat
    ci_upper = stat - t_l * std_err
    if bias_corr is True:
        stat = stat + bias
    boot_temp = namedtuple('boot', 'stat, upper_ci, bias, std_err')

    return boot_temp(stat, ci_upper, bias, std_err)

  1. Inspiration for this post comes from Efron’s fabulous Computer-intensive Methods in Statistics article in Scientific American.