Bootstrap explained simply
Contents:
What bootstrap actually is
Bootstrap is a statistical technique where you take your dataset, draw a random sample from it with replacement, compute a statistic on that sample, then repeat thousands of times. The distribution of those thousands of statistics is your estimate of how the metric would vary if you reran the experiment. From that distribution you read off a confidence interval, a p-value, or a standard error.
The trick sounds suspicious the first time you hear it. You are pretending the sample you have is the population, then drawing fake samples from it. It works because the empirical distribution of your data is the best non-parametric estimate of the underlying distribution; if your sample is large enough and roughly representative, resampling it mimics the noise of sampling the true population.
Why this matters: classical formulas only exist for a handful of statistics — the mean, the proportion, the variance — and they assume normality or independence in ways that often fail on real product data. Bootstrap works on anything. Median revenue per user, 95th percentile latency, ratio of two metrics, gini coefficient — all of these become straightforward as long as you can compute them on a sample.
Why you need it
Picture this: your PM asks for the 95% confidence interval on the median checkout latency for the new flow. You cannot pull that from a t-test — there is no clean closed-form formula for the variance of a sample median. Bootstrap gives you the answer in one screen of Python.
The pattern repeats across real product metrics. Revenue per user has a fat tail, with whales pulling the mean far from where most users sit. Conversion ratios are bounded between zero and one and behave badly near the edges. Retention curves are step functions. For all of these, the t-test gives you either a wrong answer or no answer at all.
Bootstrap is also what you reach for when the metric itself combines other metrics. CTR is clicks divided by impressions — that ratio has correlated numerator and denominator, and the delta method gets ugly fast. Bootstrap sidesteps the algebra: resample your rows, recompute the ratio, repeat.
How it works
The recipe has five steps. Start with your sample of N observations. Draw N values from it uniformly at random with replacement — so the same observation can appear multiple times in one resample, and other observations might not appear at all. Compute your statistic of interest on that resample. Save the result. Repeat the previous two steps somewhere between 1,000 and 10,000 times.
When you finish, you have a vector of B bootstrap replicates of your statistic. The 2.5th and 97.5th percentiles of that vector form your 95% percentile confidence interval. The standard deviation of the vector is your bootstrap standard error. The mean of the vector minus your original sample statistic is your bias estimate.
The reason it works is a theorem. Under mild regularity conditions, the distribution of the bootstrap statistic minus the sample statistic converges to the distribution of the sample statistic minus the population parameter. In plain words: the variability you observe from resampling matches the variability you would see if you collected fresh samples from the population.
A worked example by hand
Say you have five checkout values from a small experiment: 100, 200, 300, 400, 500. The sample median is 300. You want a confidence interval for that median.
Resample one might draw [100, 500, 300, 300, 200], median 300. Resample two draws [200, 400, 400, 100, 500], median 400. Resample three draws [500, 300, 500, 100, 200], median 300. Each resample has the same size as the original, but values repeat or vanish.
After 10,000 such resamples you have 10,000 median values. Sort them and find the 2.5th percentile (say, 200) and the 97.5th percentile (say, 400). Your bootstrap 95% confidence interval for the median is [200, 400]. No formula for the variance of a median, no distributional assumption, and the answer is defensible.
Python implementation
The core implementation is a single loop and three NumPy calls. This is enough for most production work; libraries like scipy.stats.bootstrap add bias correction and acceleration if you need them.
import numpy as np
rng = np.random.default_rng(seed=42)
data = np.array([100, 200, 300, 400, 500])
n_boot = 10000
boot_medians = np.empty(n_boot)
for i in range(n_boot):
sample = rng.choice(data, size=len(data), replace=True)
boot_medians[i] = np.median(sample)
ci_low, ci_high = np.percentile(boot_medians, [2.5, 97.5])
print(f"95% CI for median: [{ci_low:.0f}, {ci_high:.0f}]")For larger datasets, vectorize. Instead of looping, draw a 2D matrix of indices in one shot, then compute the statistic across rows. This is roughly 50x faster on a million-row table.
import numpy as np
rng = np.random.default_rng(seed=42)
data = np.array([100, 200, 300, 400, 500])
n = len(data)
n_boot = 10000
idx = rng.integers(0, n, size=(n_boot, n))
samples = data[idx]
boot_medians = np.median(samples, axis=1)
ci_low, ci_high = np.percentile(boot_medians, [2.5, 97.5])
print(f"95% CI for median: [{ci_low:.0f}, {ci_high:.0f}]")Both versions give the same answer. The second one runs in a fraction of a second on a million rows. If you push past 10 million rows, sub-sample first or move to a Spark-backed implementation.
Bootstrap in A/B tests
In an A/B test you have two groups, control and treatment, and you want the confidence interval for the difference between them. Resample both groups independently, compute the metric on each, take the difference, and repeat.
import numpy as np
rng = np.random.default_rng(seed=42)
control = rng.normal(100, 20, size=5000)
treatment = rng.normal(102, 20, size=5000)
n_boot = 10000
diffs = np.empty(n_boot)
for i in range(n_boot):
c_sample = rng.choice(control, size=len(control), replace=True)
t_sample = rng.choice(treatment, size=len(treatment), replace=True)
diffs[i] = np.median(t_sample) - np.median(c_sample)
ci_low, ci_high = np.percentile(diffs, [2.5, 97.5])
print(f"95% CI for treatment effect: [{ci_low:.2f}, {ci_high:.2f}]")If the resulting confidence interval excludes zero, the difference is statistically significant at the 5% level. If it includes zero, the test is not significant. This works for any metric: median, P95, ratio, conversion rate, custom score.
Teams at Stripe, Airbnb, and Netflix use bootstrap for A/B tests on revenue and engagement metrics precisely because those metrics have fat tails. Plain t-tests on revenue understate the variance, which inflates false positives. Bootstrap respects the actual distribution shape.
Flavors of bootstrap
The vanilla version above is sometimes called the percentile bootstrap or non-parametric bootstrap. Three variations come up in practice often enough that interviewers ask about them.
The bias-corrected and accelerated bootstrap (BCa) adjusts the percentile interval to account for bias and skewness in the bootstrap distribution. It gives narrower, more accurate intervals when the statistic is skewed — for example, when bootstrapping a sample variance or a small-sample correlation. scipy.stats.bootstrap defaults to BCa.
The block bootstrap is for time series. Standard bootstrap assumes observations are exchangeable, which is false when today's revenue depends on yesterday's. Block bootstrap fixes this by resampling contiguous blocks of observations (typically 5 to 50 in length) instead of single rows. The block length is a hyperparameter and is usually chosen with a rule of thumb based on the autocorrelation decay.
The stratified bootstrap preserves group proportions. If your sample is 70% iOS and 30% Android, a vanilla bootstrap can occasionally produce resamples with very different mixes, inflating variance. Stratified bootstrap resamples within each platform separately, then combines. Use it whenever you have a small strata or strong covariate effects.
Common pitfalls
The most common bootstrap mistake is using too few iterations when reporting a percentile interval. The 2.5th and 97.5th percentiles of 100 resamples are noisy estimates — different runs of the same code will give visibly different intervals. The fix is to use at least 1,000 iterations for a quick sanity check and 10,000 for anything that goes into a doc or a deck. Vectorized NumPy will run 10,000 iterations on a 100,000-row dataset in under a second on a laptop, so there is no excuse to skimp.
A second trap is forgetting replace=True. If you resample without replacement at the original sample size, you are just shuffling your data — every resample is a permutation, the statistic is identical every time, and the interval collapses to a point. Sanity-check by computing the metric on a single resample and confirming it differs slightly from the sample statistic.
A third pitfall is applying bootstrap to time series without blocking. If your observations are autocorrelated — daily revenue, hourly latency, session-level events from the same user — vanilla bootstrap destroys the time structure and produces overconfident intervals. The fix is either to aggregate to a unit you can treat as independent (one row per user instead of per session) or to use a block bootstrap with a block size large enough to span the autocorrelation horizon.
A fourth issue is bootstrap with tiny samples. If N is below 20, you simply do not have enough information for bootstrap to work, no matter how many iterations you run. The resampling distribution is dominated by the few values you have, and the percentile interval is degenerate. For small samples, fall back to exact tests like Fisher's exact or use Bayesian methods with informative priors.
A fifth one is not fixing the random seed. A teammate reruns your notebook, gets a slightly different interval, and now neither of you trusts the result. Always pass a seed to default_rng() so reruns are reproducible, and store that seed in the experiment metadata.
Alternatives
When bootstrap is too slow — typically on tables with billions of rows in A/B testing systems — analytical alternatives become attractive. The delta method gives a closed-form variance for ratios and other transformations using a Taylor expansion. It is approximate but runs in O(1) instead of O(B). For very large samples it agrees with bootstrap to two decimal places.
Permutation tests are bootstrap's close cousin for hypothesis testing. Instead of resampling within groups, you shuffle group labels and recompute the test statistic. The proportion of shuffles with a test statistic at least as extreme as the observed one is your p-value. Permutation tests are exact under the null hypothesis of exchangeability and are the gold standard when you have one.
Jackknife is a leave-one-out resampling method that predates bootstrap. It is faster for some statistics but less general; use it only when you specifically want a quick standard error estimate for a smooth statistic.
Related reading
- How to calculate bootstrap CI in SQL
- How to calculate CUPED in SQL
- How to calculate confidence intervals in SQL
- A/B testing peeking mistake
- Why run an A/A test in A/B testing
If you want to drill A/B testing and statistics interview questions like these every day, NAILDD is launching with hundreds of SQL and stats problems built around exactly this pattern.
FAQ
How many bootstrap iterations should I use?
For a quick exploratory check, 1,000 iterations is enough to see the rough shape. For numbers that go into a document, a deck, or a launch decision, use 10,000. The marginal cost of going from 1,000 to 10,000 is small with vectorized code, and the percentile interval tightens noticeably. Going above 10,000 gives diminishing returns; the bottleneck shifts to your statistic computation, not the resampling.
Is bootstrap valid for revenue with a long tail?
Yes — this is one of the cases where bootstrap shines. A t-test on revenue understates variance because the fat tail violates normality assumptions, and the closer you are to the boundary of the tail, the worse the approximation. Bootstrap respects the actual distribution shape, including the tail, and produces intervals that hold their nominal coverage on resimulation.
Does replace=True really matter?
Yes, and skipping it is the single most common bootstrap bug. Sampling without replacement at the original sample size produces a permutation of the data, which has the same value of any permutation-invariant statistic every single time. Your bootstrap distribution collapses to a single point and your interval has zero width. Always pass replace=True and sanity-check by comparing two consecutive resampled statistics.
When should I use block bootstrap instead of vanilla?
Use block bootstrap whenever observations are not exchangeable — typically time series like daily revenue, hourly request volume, or sequences of events from the same user. The block length should span the autocorrelation horizon of your data, which you can estimate by plotting the autocorrelation function and picking a length where the values drop below 0.1. Common block sizes are 5 to 50 observations.
Bootstrap or delta method in production A/B systems?
If you run thousands of experiments per day on billion-row tables, delta method is the practical choice — it runs in constant time and gives intervals indistinguishable from bootstrap at large sample sizes. Bootstrap is better for ad hoc analyses, smaller experiments, and metrics where the delta-method Taylor expansion breaks down (heavy skew, near-zero denominators in ratios). Many companies use delta method for the main dashboard and bootstrap for deep-dive investigations.
Can I bootstrap a regression coefficient?
Yes. Resample rows of your dataset with replacement, refit the regression on each resample, save the coefficient. After B iterations you have B coefficient values; the percentile interval is your confidence interval. This is called pairs bootstrap or case resampling and is robust to heteroskedasticity, which makes it more reliable than the textbook OLS standard errors in practice.