Bootstrap confidence intervals in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why bootstrap CI matters

It is Tuesday afternoon. Your PM at a marketplace pings you on Slack: "Pricing variant B looks +4.2% on revenue per user. Ship it?" The classic t-test answer would be a p-value, but revenue distributions are a mess — heavy-tailed, full of zeros for non-purchasers, dominated by a handful of whales. A t-test will quietly lie to you, because the central limit theorem needs more samples than you have whales.

This is where bootstrap confidence intervals earn their keep. Bootstrap estimates the uncertainty of any statistic — the mean, median, 95th percentile, a ratio of two columns — without assuming the underlying distribution is anything in particular. You hand it your data, it hands you back an interval. The price is compute, not assumptions. For analytics teams at Stripe, Airbnb, or DoorDash, that trade is the right one.

The other reason this question comes up in interviews is that bootstrap forces a candidate to demonstrate two things at once. Can they think probabilistically — do they understand what "the sample is the population, resampled" means? Can they implement it in pure SQL when a notebook is not available, which is common in a warehouse-first workflow on Snowflake or BigQuery? Get both right and you have signaled senior data analyst level.

The resampling idea in one paragraph

The mechanics are short enough to memorize. Take your original sample of size N. Draw N rows from it with replacement — a single row can appear more than once, and others may not appear at all. That is one bootstrap sample. Compute the statistic of interest on it. Repeat B times, usually between 1,000 and 10,000. You now have B values of the statistic. The 95% confidence interval is the empirical 2.5th and 97.5th percentile of those B values. That is the whole algorithm. The rest of this post is how to make that algorithm cooperate with SQL.

Bootstrap CI for the mean

Here is a Postgres-flavored implementation using generate_series to drive iterations and random() to drive resampling. The example computes a 95% CI for mean revenue per user in the variant B arm of a pricing experiment.

WITH source AS (
    SELECT user_id, revenue
    FROM ab_users
    WHERE experiment_id = 'pricing_v2' AND variant = 'B'
),
n_source AS (SELECT COUNT(*) AS n FROM source),
boot_iterations AS (
    SELECT generate_series(1, 1000) AS iteration
),
bootstrap_means AS (
    SELECT
        bi.iteration,
        AVG(s.revenue) AS boot_mean
    FROM boot_iterations bi
    CROSS JOIN LATERAL (
        SELECT revenue
        FROM source
        ORDER BY random()
        LIMIT (SELECT n FROM n_source)
    ) s
    GROUP BY bi.iteration
)
SELECT
    PERCENTILE_CONT(0.025) WITHIN GROUP (ORDER BY boot_mean) AS ci_lower,
    PERCENTILE_CONT(0.5)   WITHIN GROUP (ORDER BY boot_mean) AS boot_median,
    PERCENTILE_CONT(0.975) WITHIN GROUP (ORDER BY boot_mean) AS ci_upper
FROM bootstrap_means;

A subtle point is worth pausing on. The clause ORDER BY random() LIMIT N without any with-replacement trick is technically a 100% subsample. If N equals the size of source, you have just shuffled the table and taken everything back — the result is the original mean every iteration, and your CI collapses to a point. The version above works because the LATERAL subquery shuffles independently per iteration only when the engine treats random() as volatile, which Postgres does. On warehouses with deterministic compilation, you may need to inject the iteration number into the seed.

The cleaner way to express true sampling-with-replacement in SQL is to generate N uniform random indices in [1, N] for each iteration, then look up the corresponding rows via ROW_NUMBER() over source and a join. In practice, on samples above a few thousand rows, the difference between true bootstrap and 90% subsampling is invisible, and most production pipelines settle for the simpler variant. Know the distinction for interviews, even if you cut the corner in real work.

Bootstrap for an A-B difference

The whole reason analysts reach for bootstrap is comparing two groups. Here is the differences-of-means version. The verdict CASE statement encodes the standard rule: if the 95% CI of the difference excludes zero, the experiment is significant at the 5% level.

WITH source_a AS (
    SELECT revenue FROM ab_users WHERE experiment_id = 'pricing_v2' AND variant = 'A'
),
source_b AS (
    SELECT revenue FROM ab_users WHERE experiment_id = 'pricing_v2' AND variant = 'B'
),
boot AS (
    SELECT
        gs AS iteration,
        (SELECT AVG(revenue) FROM (
            SELECT revenue FROM source_b ORDER BY random()
            LIMIT (SELECT COUNT(*) FROM source_b)
        ) b) -
        (SELECT AVG(revenue) FROM (
            SELECT revenue FROM source_a ORDER BY random()
            LIMIT (SELECT COUNT(*) FROM source_a)
        ) a) AS diff
    FROM generate_series(1, 1000) gs
)
SELECT
    PERCENTILE_CONT(0.025) WITHIN GROUP (ORDER BY diff) AS ci_lower,
    PERCENTILE_CONT(0.975) WITHIN GROUP (ORDER BY diff) AS ci_upper,
    AVG(diff) AS mean_diff,
    CASE
        WHEN PERCENTILE_CONT(0.025) WITHIN GROUP (ORDER BY diff) > 0
          OR PERCENTILE_CONT(0.975) WITHIN GROUP (ORDER BY diff) < 0
        THEN 'CI excludes 0, treat AS significant'
        ELSE 'CI contains 0, do NOT ship ON this evidence'
    END AS verdict
FROM boot;

Interviewers love this template because it exposes two real-world habits. One: pairing each iteration's resample for arm A with a fresh independent resample for arm B. Two: presenting the verdict in plain language rather than dumping a number. A senior candidate at Notion or Linear will say "B is ahead by $0.42 per user, 95% CI $0.05 to $0.79, ship" instead of mumbling about percentiles.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Bias-corrected CI in plain words

The naive percentile CI is biased on skewed distributions. Imagine resampling from a distribution where the true mean is dragged up by one whale. Half your bootstrap samples miss the whale, half include them — the bootstrap distribution becomes bimodal and asymmetric, and a simple [2.5%, 97.5%] cut systematically under-covers the true mean. Bias-corrected accelerated (BCa) intervals fix this by computing a bias correction z0 and an acceleration constant a, then mapping back through the standard normal.

z0 = Phi-inverse( P(boot_stat < observed_stat) )
adjusted_lower_quantile = Phi( z0 + (z0 + z_alpha) / (1 - a * (z0 + z_alpha)) )

Implementing it in pure SQL is painful because Postgres ships no inverse normal CDF. The pragmatic interview answer: I know percentile CI is biased on skewed metrics, I would run bootstrap in SQL to materialize the B values, then compute BCa in a small Python or R post-step. That signals you understand both the theory and where SQL stops being the right tool.

Common pitfalls

The first trap is subsampling masquerading as bootstrap. ORDER BY random() LIMIT N without replacement on a table of size N returns the same N rows in a different order, so your bootstrap means collapse to the sample mean and your CI width becomes zero. The fix is either to use index-based resampling with replacement, or to set the LIMIT to roughly 90% of N and accept that you are doing subsample-bootstrap. Either way, sanity-check that your CI has nonzero width on toy data before trusting it on real data.

The second trap is running too few iterations. B = 100 produces CIs that swing by 10 to 20 percent of their own width between runs, which is unacceptable for a decision that ships pricing changes. The conventional minimum is B = 1,000 for an exploratory chart, B = 10,000 for a production decision memo. The compute cost is linear in B, and modern warehouses handle 10,000 iterations on samples of 100,000 rows in seconds. There is no good reason to skimp here.

The third trap is bootstrapping data that violates independence. If a single user contributes thirty sessions to your sample, those rows are not independent observations, and bootstrap at the row level will dramatically understate uncertainty. The fix is to aggregate to the user level before bootstrapping, so each resampled unit is a user, not a session. For genuinely autocorrelated data — clicks within a session, prices within a day — you need block bootstrap, where you resample contiguous chunks instead of individual rows.

The fourth trap is picking the wrong statistic to bootstrap. If the metric the business cares about is p99 latency, you should compute PERCENTILE_CONT(0.99) inside each bootstrap iteration and then take the CI of the resulting p99 values. A common mistake is to take the bootstrap CI of the mean and report it as if it characterized the tail. The mean tells you nothing about p99 and the bootstrap will dutifully give you a tight, useless interval.

The fifth trap is irreproducibility. The default random() in Postgres is not seeded between sessions, so two analysts running the same query get slightly different CIs and end up arguing about it. Call SETSEED(0.42) at the start of the session, document the seed, and the CI is reproducible to the cent.

Optimization tips

If your source table has more than a million rows, materialize the source CTE first so the engine does not re-read disk for every iteration. On Snowflake or BigQuery, this means writing the source filter into a temp table or a CREATE TEMP TABLE AS step before the loop. The bootstrap CTE then scans a hot, narrow table B times instead of joining the wide fact table B times. In practice this cuts wall time by ten to a hundred times on warehouse engines that are not great at LATERAL.

For very large samples, replace the explicit generate_series loop with a column-trick. Tag every source row with B random integers between 1 and B, unnest, and group by iteration. This vectorizes the resample into a single scan with B times the cardinality, which is faster than B separate scans whenever your engine has a fast UNNEST or array path — DuckDB and BigQuery both qualify. Snowflake's approach is similar but uses FLATTEN over an array.

If you only need a CI for the mean and you trust normality at large N, fall back to the asymptotic formula mean +/- 1.96 * std / sqrt(n) and skip bootstrap entirely. It is one query instead of B, and on samples above 50,000 with non-pathological tails it agrees with bootstrap to within a percent. The art is knowing when your tails are non-pathological. A quick PERCENTILE_CONT(0.99) / AVG() ratio above 50 is a strong hint to stay on bootstrap.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems built around exactly these interview patterns.

FAQ

How many bootstrap iterations should I actually run?

For an exploratory chart in a notebook, 1,000 iterations is enough — the CI will be stable to about one percent of its width. For a written analysis that informs a ship-or-not decision, run 10,000. For an internal calculator that recomputes nightly across hundreds of segments, you can drop back to 1,000 because the noise averages out across segments. The trade is purely compute, so default to 10,000 unless you have a reason not to.

Is running bootstrap in pure SQL a good idea, or should I move to Python?

For one-off analyses inside a warehouse-first stack at companies like Snowflake, Databricks, or any modern analytics org, running bootstrap directly in SQL is fine and often faster than exporting data to a notebook. For recurring production pipelines, move the loop to Python with numpy and numba — the inner loop becomes a tight C kernel and you stop fighting the planner. The break-even point in my experience is around three analysts touching the query.

When should I prefer bootstrap over a t-test?

Bootstrap wins when the metric is anything other than a mean, when the distribution is heavy-tailed or has many zeros, or when the sample is small enough that the central limit theorem has not kicked in. T-test wins when you have a clean mean on a normal-ish distribution at N > 1,000 — it is faster and the result is mathematically equivalent within rounding. The skill is recognizing which case you are in before you run the test, not after.

Do I need bias correction in real analyst work?

On metrics like revenue per user, basket size, or session length — heavy-tailed positive distributions — bias correction matters and you should plan to compute BCa intervals outside SQL. On metrics like activation rate, click-through rate, or any bounded ratio in the 1 percent to 99 percent range, the percentile CI is essentially identical to BCa and there is no point bothering. Most interview answers stop at percentile CI and that is fine, as long as you can explain when it breaks.

Can bootstrap be parallelized in SQL?

Not natively in classic row-store engines — the loop over iterations is sequential. DuckDB parallelizes the inner aggregation across cores automatically, which gets you most of the speedup. For genuinely embarrassingly parallel bootstrap on a billion-row sample, use Spark or Ray to distribute iterations across a cluster. The pure-SQL version is fine for sample sizes up to a few million rows; above that, move to a distributed engine.

What seed should I use, and does it matter?

Any constant integer you write down in the analysis. The value does not matter for the statistics — what matters is that you set one. Without SETSEED, every run of your query produces a slightly different CI, and the resulting "why is your number off by $0.03 from yesterday" conversations are a time sink. One seed line saves hours of debugging trust.