Bootstrap in A/B testing
Contents:
Why bootstrap matters in A/B testing
Imagine you are a data scientist at a marketplace like Airbnb or DoorDash. A pricing experiment has been running for ten days, the dashboard says revenue per user is up 3.8%, and your PM wants a green or red call by Monday. The reflex is a two-sample t-test. The problem is that revenue per user is one of the worst-behaved metrics in the product analytics toolkit — heavy-tailed, dominated by whales, full of zeros for non-purchasers. The central limit theorem is supposed to rescue you at large N, but the convergence is painfully slow on this distribution.
Bootstrap fills that gap. It is a non-parametric way to estimate the uncertainty of any statistic without pretending the underlying distribution is normal. You hand it your data, it hands you back a confidence interval and a p-value. The price is compute, not assumptions. Netflix, Stripe, and Microsoft's experimentation platform all use bootstrap variants in production because metric definitions evolve faster than the textbook can keep up.
The resampling idea in one minute
The mechanics are short enough to memorize before a screen. Take your original sample of size N. Draw N rows from it with replacement — a single row can appear several times in one resample, and others may not appear at all. That is one bootstrap sample. Compute the statistic of interest on it, store the result, and repeat between 1,000 and 10,000 times. The 2.5th and 97.5th percentiles of that vector form a 95% percentile confidence interval; the standard deviation is the bootstrap standard error.
The reason this works is a theorem. Under mild regularity conditions, the distribution of the bootstrap statistic around the sample statistic converges to the distribution of the sample statistic around the true population parameter. In plain words: the variability you see from resampling your own data is a good proxy for the variability you would see by collecting fresh samples from the population.
Bootstrap for a single arm
Here is the minimal NumPy implementation for a 95% confidence interval on the mean of one arm. It works just as well for the median, the 90th percentile, or any custom function — swap sample.mean() for np.median(sample) and the rest of the code does not change.
import numpy as np
def bootstrap_mean(data, n_iter=10000, seed=42):
rng = np.random.default_rng(seed)
n = len(data)
means = np.empty(n_iter)
for i in range(n_iter):
sample = rng.choice(data, size=n, replace=True)
means[i] = sample.mean()
return means
# Usage on a synthetic revenue distribution
data = np.random.default_rng(0).exponential(scale=10, size=1000)
boot_means = bootstrap_mean(data)
ci_lower, ci_upper = np.percentile(boot_means, [2.5, 97.5])
print(f"95% CI for the mean: [{ci_lower:.2f}, {ci_upper:.2f}]")The function pre-allocates the output array instead of appending to a list, which gives a measurable speedup on large iteration counts. The default_rng API avoids the global-state surprises of the legacy np.random interface.
Bootstrap for an A/B difference
For an A/B test the statistic of interest is the difference between treatment and control. Resample each arm independently, compute the difference of means on each pair of resamples, and read the confidence interval off the resulting vector.
def bootstrap_diff(control, treatment, n_iter=10000, seed=42):
rng = np.random.default_rng(seed)
n_c, n_t = len(control), len(treatment)
diffs = np.empty(n_iter)
for i in range(n_iter):
c_sample = rng.choice(control, size=n_c, replace=True)
t_sample = rng.choice(treatment, size=n_t, replace=True)
diffs[i] = t_sample.mean() - c_sample.mean()
return diffs
diffs = bootstrap_diff(control_revenue, treatment_revenue)
ci_lower, ci_upper = np.percentile(diffs, [2.5, 97.5])
p_value = 2 * min(np.mean(diffs > 0), np.mean(diffs < 0))
print(f"Diff CI: [{ci_lower:.2f}, {ci_upper:.2f}] p ≈ {p_value:.4f}")Two things to notice. The p-value here is a two-sided approximation that doubles the minimum tail mass under a null of zero difference; for a strict null hypothesis test, prefer a permutation test, covered below. Second, this is a percentile CI — if the bootstrap distribution is skewed, a bias-corrected and accelerated (BCa) interval gives tighter, more honest bounds at the cost of some additional code.
Permutation test as a sibling
Bootstrap estimates the distribution of a statistic under the data-generating process you actually observed. A permutation test estimates the distribution under the strict null that the two arms come from the same population. Both procedures resample, but they answer different questions, and on a strict null hypothesis test the permutation answer is cleaner.
def permutation_test(control, treatment, n_iter=10000, seed=42):
rng = np.random.default_rng(seed)
observed_diff = treatment.mean() - control.mean()
combined = np.concatenate([control, treatment])
n_c = len(control)
null_diffs = np.empty(n_iter)
for i in range(n_iter):
rng.shuffle(combined)
null_diffs[i] = combined[n_c:].mean() - combined[:n_c].mean()
p_value = np.mean(np.abs(null_diffs) >= np.abs(observed_diff))
return p_valueUse a permutation test when the question is binary — did the treatment do anything at all — and reach for bootstrap when the question is quantitative, such as "how big is the effect and how sure are we." On most modern A/B platforms both are available side by side.
Bootstrap vs t-test
The t-test assumes the sampling distribution of your statistic is approximately normal. For a sample mean drawn from a roughly symmetric distribution with finite variance, that is usually true once N is in the low hundreds. For a sample mean drawn from a heavy-tailed distribution like revenue per user, sample sizes in the tens of thousands can still leave the sampling distribution visibly skewed, and the t-test silently underestimates the standard error in that regime.
Bootstrap is non-parametric. It works for any statistic you can compute — medians, ratios, percentiles, custom aggregations like "average LTV per acquisition cohort." It is slower than a t-test by orders of magnitude, but on a laptop a 10,000-iteration bootstrap on a million-row dataset finishes in single-digit seconds. The right mental model is that the t-test is a fast default for well-behaved metrics and bootstrap is the flexible fallback. Senior analysts run both and worry when they disagree.
When bootstrap is the right tool
Non-normal data is the textbook case. Revenue per user, session length, time-to-conversion, and ad spend per advertiser are all heavy-tailed, routinely measured in production experiments at Stripe and Uber, and all places where a t-test will give you a misleadingly tight interval.
Custom metrics are the second case. "Average LTV across the first 90 days for users acquired through paid social" wires together a windowed sum, a join on attribution, and a per-user aggregation. Deriving the variance of that statistic in closed form is a multi-hour exercise. Bootstrapping it is a five-minute exercise.
Ratio metrics are the third — revenue per session, clicks per impression, orders per visit. Bootstrap resamples the rows, recomputes the ratio, and gives you the same answer with less algebra than the delta method. Quantile metrics are the fourth: the t-test is built for the mean, but bootstrap handles every quantile with the same code.
Common pitfalls
When teams adopt bootstrap for the first time, the most frequent trap is running it on samples that are too small. If your observed data has only 20 rows, resampling those 20 rows ten thousand times does not invent new information; the interval you compute will be tight and dangerously over-confident. The practical rule is that bootstrap is reliable once each arm has at least a few hundred observations, and the heavier the tail the higher that floor needs to be.
The second trap is forgetting the independence assumption. Bootstrap as written above assumes the rows you resample are independent and identically distributed. Time series, repeated observations of the same user across days, and clustered data such as multiple sessions per household all violate this. The fix is a block bootstrap for time-ordered data and a cluster bootstrap for grouped data — resample at the level of the group, not the row, so within-group correlation is preserved.
The third trap is computational cost on production-sized data. Ten thousand iterations of a complex metric on a hundred million rows can spin a notebook for hours. Vectorize ruthlessly, subsample if the metric is well-conditioned, and push the resampling into the warehouse with generate_series when the data does not fit in memory.
The fourth trap is heavy-tail dominance. When a metric has a long right tail, the largest few values move the bootstrap mean every iteration, and the resulting interval is wide and dominated by them. The fix is to use a more robust statistic — median, trimmed mean, or winsorized mean — so a single whale does not control the answer.
The fifth trap is blind application. Bootstrap does not magically rescue a broken experiment design. If your randomization is unbalanced, if there is selection bias in who entered the test, or if novelty effects are still active, no amount of resampling will hide that. Run an A/A test first and read your bootstrap interval after, not before, you trust the design.
Performance tricks
Vectorize before you parallelize. NumPy can broadcast rng.choice over a 2D shape, which lets you draw all your resamples in a single call instead of looping iteration by iteration. For the simple mean and difference-of-means cases this gives roughly a 10x speedup over the explicit Python loop in the snippets above.
Subsample when the metric is well-conditioned. If your dataset has millions of rows and the metric is a simple mean, drawing each resample of size N gives diminishing returns past about 100,000 observations. Capping the resample size while keeping the iteration count steady is a common production trick. Parallelize with multiprocessing or joblib when you cannot vectorize — each bootstrap iteration is embarrassingly parallel, and a 16-core machine gives you a 10-12x wall-clock speedup without much code change. For the largest jobs, compile with numba or run on GPU through JAX.
The interview answer
"What is bootstrap" — resampling your observed data with replacement to estimate the sampling distribution of any statistic. The empirical distribution of your data acts as a stand-in for the population.
"When would you use it" — non-normal data, custom or ratio metrics, quantile-based metrics, and any time the analytical variance formula is missing or fragile.
"What are the alternatives" — t-test for well-behaved means, permutation tests for strict null hypothesis questions, Bayesian methods when you have informative priors, and the delta method for simple ratios.
"Is bootstrap always better" — no. It needs at least a few hundred independent observations per arm, it is heavier than a t-test, and it inherits any design flaw in the underlying experiment. In production at Netflix or Linear you would report both a bootstrap interval and a classical interval so peer reviewers can see when they disagree.
Related reading
- Bootstrap explained simply
- How to calculate bootstrap CI in SQL
- Bayesian A/B testing — practical guide
- A/B testing peeking mistake
- How to calculate CUPED in SQL
If you want to drill A/B testing and statistics questions like this every day, NAILDD is launching with hundreds of interview-grade problems built around exactly this pattern.
FAQ
How many bootstrap iterations are enough?
For a 95% confidence interval on a simple statistic such as the mean, 1,000 iterations are usually enough to stabilize the bounds to two significant figures. For p-values in the 0.01 range or tighter, push to 10,000. For very small effect sizes near the decision boundary, 50,000 iterations and a fixed random seed are not excessive — you want the answer to be reproducible across runs.
Does bootstrap work for medians and percentiles?
Yes, and that is one of its strongest selling points. The same code that bootstraps a mean produces a confidence interval for the median by replacing sample.mean() with np.median(sample). The same trick works for any percentile or the interquartile range. For very high percentiles such as p99 latency the bootstrap variance grows quickly, so you need both a large sample and many iterations to get a stable answer.
How is bootstrap different from a Bayesian A/B test?
They live in different statistical frameworks but produce similar-looking outputs. A Bayesian test starts from a prior on the metric, updates with the observed data, and reports a posterior. A bootstrap starts from the observed data alone and reports an empirical sampling distribution. Bayesian methods are stronger when you have meaningful prior information; bootstrap is stronger when you do not and want a non-parametric answer. Modern A/B platforms at companies like Netflix often combine both.
Can I bootstrap a tiny experiment with 30 users per arm?
Technically yes, the code runs. Practically no — with 30 users per arm the empirical distribution is a poor stand-in for the population, especially on heavy-tailed metrics, and the bootstrap interval will be misleadingly tight. Pool with historical data, switch to a variance reduction technique like CUPED, or wait for the sample to grow before reporting.
What if my data is not independent?
Standard bootstrap underestimates variance whenever the rows are correlated, such as time-series observations or multiple sessions per user. The fix is structural: use block bootstrap for time-ordered data, where you resample contiguous chunks instead of individual rows, and cluster bootstrap for grouped data, where you resample whole groups — a household, a user, an account — and keep all of that group's rows together.
Should I report a p-value or just the confidence interval?
The interval is more informative because it tells you the size and direction of the uncertainty, not just whether it crossed an arbitrary threshold. In an interview, report the interval first and the p-value as a secondary number.