Bootstrap statistics fundamentals
Contents:
What bootstrap actually is
Bootstrap is a resampling method. The idea is mechanical, not magical: instead of trusting a closed-form confidence interval that assumes a normal distribution, you generate thousands of "virtual samples" from the data you already have and watch how the statistic of interest behaves across them. The variation across those virtual samples becomes your uncertainty estimate.
Picture the scenario that lands on a data scientist's desk every other Monday. A PM at Stripe pings you: "What is the median time-to-first-payment for new merchants, and how confident are we?" You have 500 onboarding records. The analytical formula for a median CI is a mess — it depends on order statistics and behaves poorly at small samples. Bootstrap sidesteps it. You draw 500 observations from the same 500, with replacement, compute the median, and repeat 10,000 times. You now have an empirical distribution of the median, and you read the 2.5th and 97.5th percentiles straight off it.
The phrase with replacement is the entire trick. One observation can land in the resample several times, others may not appear at all. That is what makes each bootstrap sample slightly different from the original — without replacement you would just be reshuffling the same numbers and every statistic would come out identical.
When bootstrap earns its keep
Classical formulas for confidence intervals depend on a stack of assumptions: a normal distribution, a sample large enough to invoke the central limit theorem, and a known formula for the standard error of your statistic. Bootstrap needs none of those. It trades algebra for compute, which is usually a great trade in 2026.
Four situations make bootstrap the only sane option. The first is when no formula exists. Medians, 90th percentiles, ratio metrics like revenue-per-user — these statistics do not have clean analytical confidence intervals. Even when a formula exists on paper, like the delta method, it is an approximation that fails at the tails.
The second is skewed data. Revenue, session duration, basket size — almost any monetary metric in a real product is heavy-tailed. The z-formula for a mean CI will undercover, meaning your "95% CI" contains the truth maybe 88% of the time. Bootstrap respects the actual shape of the data.
The third is small samples. At n below 30, parametric methods demand strong assumptions about normality, and you rarely have evidence to back them up. Bootstrap still works, with the caveat that at very small n (under 15) the resample distribution is discrete and noisy.
The fourth is A/B tests with ratio metrics. Average order value, revenue per user, sessions per user — these are ratios where the delta method often misbehaves. Bootstrap gives a CI that matches the empirical distribution exactly. This is how teams at Netflix and DoorDash report uplift for revenue-style metrics.
The algorithm step by step
The procedure is short enough to memorise on the walk to a whiteboard interview. You start with an original sample of size n. You then randomly draw n elements from it with replacement — that is one bootstrap sample. You compute whatever statistic you care about: the mean, the median, a difference of conversion rates, a ratio of two metrics, anything that takes data in and returns a number. You repeat the draw-and-compute step B times, where B is typically 10,000. The B values form the bootstrap distribution. To build a 95% confidence interval, you take the 2.5th and 97.5th percentiles of that distribution.
This recipe is called the percentile bootstrap. Other variants exist — BCa (bias-corrected and accelerated), studentised bootstrap, parametric bootstrap — but interviewers almost always mean the percentile version unless they explicitly say otherwise. If you can implement the percentile bootstrap from scratch in five minutes, you have covered 90% of what gets asked in data science loops.
Python: bootstrap in ten lines
import numpy as np
data = np.array([12, 15, 14, 10, 13, 18, 22, 11, 9, 16,
14, 20, 13, 17, 15, 19, 11, 25, 14, 12])
n_bootstrap = 10_000
boot_medians = np.array([
np.median(np.random.choice(data, size=len(data), replace=True))
for _ in range(n_bootstrap)
])
ci_lower, ci_upper = np.percentile(boot_medians, [2.5, 97.5])
print(f"Median: {np.median(data)}")
print(f"95% CI (bootstrap): [{ci_lower}, {ci_upper}]")
# Median: 14.0
# 95% CI (bootstrap): [12.5, 17.0]The flag replace=True is the with-replacement step. Drop it and the code stops being bootstrap and turns into a permutation of the same 20 values, which is a different method that answers a different question.
For production code, scipy has a built-in since version 1.9 that is faster and handles edge cases:
from scipy.stats import bootstrap
data_tuple = (data,)
result = bootstrap(data_tuple, np.median, n_resamples=10_000)
print(f"95% CI: [{result.confidence_interval.low:.1f}, "
f"`{result.confidence_interval.high:.1f}]")If you are running bootstrap inside a Snowflake pipeline rather than a notebook, you can express the same logic in SQL — see how to calculate bootstrap CI in SQL for the warehouse pattern.
Bootstrap inside A/B tests
In A/B tests bootstrap shines for metrics that misbehave under the CLT or that lack a tidy standard-error formula. The classic example is average revenue per user. Revenue distributions are heavily right-skewed because a handful of large purchases dominate the mean. A naive t-test on the difference of means tends to produce p-values that are too optimistic — you reject the null when you should not.
The bootstrap protocol for an A/B test is straightforward. For each iteration you resample with replacement from the control group, separately resample with replacement from the treatment group, and compute the difference of the chosen statistic — usually the mean, sometimes a quantile or a ratio. You repeat 10,000 times. The resulting distribution of differences gives you a 95% CI by percentile. If that interval excludes zero, you have a significant effect at the 5% level. If it crosses zero, you do not.
A second common A/B use case is ratio metrics. Sessions per user is total_sessions / total_users, where both terms vary across the experiment. The delta method gives a first-order approximation that works when the denominator variance is small, but breaks down when users have wildly different session counts. Bootstrap handles it without ceremony: you resample users with replacement (not sessions — the unit of independence is the user), recompute the ratio, and let the empirical distribution do the work.
Bootstrap vs parametric tests
| Parametric tests | Bootstrap | |
|---|---|---|
| Assumptions | Normality or large sample | Minimal |
| Speed | Instant | Seconds (10K iterations) |
| Arbitrary statistics | No (need an SE formula) | Yes |
| Accuracy at small n | Depends on assumptions | Depends on data |
| Interpretation | Familiar | Familiar |
If your data is roughly normal and you have an analytical formula, a parametric test gives the same answer faster. Bootstrap is not "better" — it is more general. The right rule of thumb is: use a parametric test when the assumptions are obviously satisfied, and reach for bootstrap when they are not or when no formula exists.
Common pitfalls
The first trap is using too few iterations. People see "Monte Carlo" and assume 100 or 500 resamples will do — they will not. CI endpoints are themselves random variables, and at 500 iterations two runs of the same code give visibly different intervals. The minimum for stability is 5,000; the working standard is 10,000. For p-value estimation in the tails you may need 50,000 because the tails are sparser by construction.
The second trap is forgetting replace=True. Without replacement, every "resample" is a permutation of the original data, and every statistic is identical to the original. The code runs, prints a CI of zero width, and looks wrong only if you check. Permutation tests are a real method too, but they test the null hypothesis that two groups are exchangeable, not how uncertain a statistic is.
The third trap is non-representative input. Bootstrap creates new samples from the data you have. If that data is biased — survivorship bias, a logging bug that drops failed sessions — the bootstrap faithfully reproduces the bias. Garbage in, garbage out, with a tight CI to make the garbage look credible. Always sanity-check the original sample before resampling it 10,000 times.
The fourth trap is dependent data. Standard bootstrap assumes observations are independent. For time series that assumption is dead on arrival because today's value is correlated with yesterday's. You need a block bootstrap, which resamples contiguous chunks rather than individual points, preserving the local autocorrelation structure. For hierarchical data you resample at the highest level of independence, not the row level.
The fifth trap is bootstrapping on the wrong unit. If your randomisation unit in an experiment is the user but you bootstrap on sessions, you underestimate variance because sessions within a user are correlated. Always resample at the unit of randomisation. This is the single most common quiet mistake in shipped experimentation pipelines.
Interview questions you will actually hear
These are the five questions that recur in data scientist loops at Meta, Airbnb, DoorDash, and similar companies. Memorise the structure of each answer, not the words.
What is bootstrap in plain English? It is a way to estimate the uncertainty of a statistic — a mean, a median, a difference, a ratio — without needing a formula for its standard error. You take your sample, draw many new samples of the same size from it with replacement, compute the statistic on each one, and look at how it varies. The spread gives you a confidence interval directly via percentiles.
Why with replacement? Without replacement, every resample is just a permutation of the original data, so every statistic equals the original and you learn nothing. With replacement, each resample differs — some observations duplicate, some drop out — and that variability is exactly what generates the empirical distribution you need.
When is bootstrap better than analytical formulas? When there is no standard-error formula for your statistic (medians, percentiles, ratios), when the distribution is far from normal, or when your sample is small and you do not trust the parametric assumptions. For a plain mean from a roughly normal large sample, a t-test is faster and just as accurate.
How many iterations do you need? For a 95% confidence interval, 5,000 is the minimum and 10,000 is the standard. For p-value estimation, especially in the tails, you may need 50,000 or 100,000 because the rare-event probability you are estimating has high relative variance.
How do you apply bootstrap to an A/B test? Resample with replacement from control and treatment independently. For each iteration compute the difference of the chosen metric. Build a 95% CI from the 2.5th and 97.5th percentiles of those differences. If zero is outside the interval, the effect is significant at the 5% level. Resample at the unit of randomisation, not the row level.
Related reading
- Bootstrap explained simply
- How to calculate bootstrap CI in SQL
- How to calculate confidence interval in SQL
- How to calculate delta method in SQL
- A/B testing peeking mistake
If you want to drill statistics questions like this every day, NAILDD is launching with 500+ data science problems across exactly this pattern.
FAQ
What is bootstrap in statistics?
Bootstrap is a resampling method that estimates the confidence interval of any statistic without analytical formulas. From the original sample you repeatedly draw random sub-samples of the same size with replacement, compute the statistic on each, and then read percentiles off the resulting empirical distribution. It works for means, medians, ratios, regression coefficients, almost anything.
How is bootstrap different from a permutation test?
Bootstrap resamples with replacement from each group separately and estimates the distribution of a statistic. A permutation test shuffles group labels without replacement and tests the null hypothesis that the two groups are exchangeable. They solve different problems: bootstrap builds confidence intervals, permutation gives p-values for null-hypothesis tests. They often coexist in the same experimentation framework.
How many iterations should I use for bootstrap?
For a stable 95% confidence interval, 5,000 iterations is the minimum and 10,000 is the working standard at most data science teams. Below 5,000 the endpoints drift visibly between runs of the same code. For p-value estimation in the tails of the distribution you may need 50,000 to 100,000 to get reliable estimates of rare-event probabilities.
Can I bootstrap a small sample?
Yes, but with caveats. Bootstrap works mechanically at any sample size, but at n below 15 the resample distribution becomes discrete and noisy, and at any size if the original sample is unrepresentative the bootstrap faithfully reproduces the bias. For very small samples, Bayesian methods with a sensible prior often give more honest uncertainty estimates than a vanilla bootstrap.
Does bootstrap work for time series?
Not the standard version. Standard bootstrap assumes independent observations, and time series are autocorrelated by definition. You need a block bootstrap, which resamples contiguous chunks of the series rather than individual points, preserving the short-range dependence structure. Choosing the block length is the main practical tuning decision.
Is bootstrap a Bayesian method?
No. Bootstrap is a frequentist resampling technique — it estimates the sampling distribution of a statistic under the assumption that the observed sample is representative of the population. There is a Bayesian cousin called the Bayesian bootstrap, which reweights observations with Dirichlet-distributed weights, but the percentile bootstrap people mean in 99% of conversations is strictly frequentist.