CUPED in SQL: variance reduction for A/B tests

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why CUPED matters

CUPED — Controlled-experiment Using Pre-Experiment Data — is the single highest-leverage trick in product experimentation. The setup is familiar. Your PM ships a paywall change at Stripe, Netflix, or DoorDash, and the experiment needs to either hit a 2% revenue uplift in three weeks or be killed. The raw metric is noisy because user revenue spans two orders of magnitude. You either run for three months and miss the launch window, or you call the test early and ship a coin-flip. CUPED gives you a third option: same minimum detectable effect on 20 to 50 percent fewer users, because it strips out variance that has nothing to do with the treatment.

Microsoft published the original paper in 2013, and by now every serious experimentation platform — internal ones at Meta, Netflix, Uber, Airbnb, and the commercial ones like Eppo and Statsig — ships CUPED as a default adjustment. The interview question that lands you a senior analytics role is rarely "do you know CUPED" — it is "write the SQL". Adjusting a metric with a pre-experiment covariate, computing theta from pooled data, then running the t-test on the adjusted series is a five-minute screen for whether you have ever actually shipped an A/B test or just read the wiki.

This post walks through the recipe end to end with a runnable SQL query, the algebra you need to defend in a code review, and the traps that hiring managers know good candidates fall into.

The covariate idea in one paragraph

Suppose every user in the experiment has a revenue number from the 30 days before bucketing (call it X) and a revenue number during the experiment window (call it Y). High-spending users before the test tend to spend more during the test as well — X and Y are correlated. Subtract a scaled version of X from Y and you remove the part of Y that you could already predict from history. The scaled version is theta * (X - mean(X)), where theta = cov(X, Y) / var(X) is the slope of the best-fit line from X to Y. The adjusted metric Y_cuped = Y - theta * (X - mean(X)) has the same expected value as Y but lower variance. The reduction equals (1 - rho^2) * var(Y), where rho is the correlation between X and Y. Treatment effect estimates stay unbiased; confidence intervals shrink.

CUPED in SQL

The query has three layers. First, assemble per-user covariate and outcome. Second, compute theta and the covariate mean on pooled data. Third, build the adjusted metric and aggregate per variant. Run this against any modern warehouse with the population covariance and variance functions — Snowflake, BigQuery, Redshift, Postgres, Databricks all support them.

WITH base AS (
    SELECT
        u.user_id,
        u.variant,
        u.revenue_pre AS x,              -- covariate: revenue IN the 30 days before bucketing
        COALESCE(SUM(o.amount), 0) AS y  -- metric: revenue inside the experiment WINDOW
    FROM ab_users u
    LEFT JOIN orders o
      ON o.user_id = u.user_id
     AND o.created_at >= u.bucket_at
    WHERE u.experiment_id = 'paywall_v3'
    GROUP BY u.user_id, u.variant, u.revenue_pre
),
theta_calc AS (
    SELECT
        COVAR_POP(y, x) / NULLIF(VAR_POP(x), 0) AS theta,
        AVG(x) AS mean_x
    FROM base
),
adjusted AS (
    SELECT
        b.variant,
        b.y,
        b.y - t.theta * (b.x - t.mean_x) AS y_cuped
    FROM base b
    CROSS JOIN theta_calc t
)
SELECT
    variant,
    AVG(y) AS mean_y_raw,
    AVG(y_cuped) AS mean_y_cuped,
    VAR_SAMP(y) AS var_y_raw,
    VAR_SAMP(y_cuped) AS var_y_cuped,
    1 - VAR_SAMP(y_cuped) / NULLIF(VAR_SAMP(y), 0) AS variance_reduction
FROM adjusted
GROUP BY variant;

The mean of y_cuped should be within rounding distance of the mean of y — the covariate is centered, so the adjustment integrates to zero across the full sample. The variance, in contrast, should drop meaningfully. For a Netflix-style streaming revenue metric you typically see 25 to 40 percent reduction; for a marketplace GMV at Uber or DoorDash, 15 to 30 percent is the usual band.

A subtle point that catches people in code review: theta_calc runs over the union of variants. Split it per variant and you bake the treatment effect into the slope — the adjusted metric ends up biased toward zero. Pooled theta is what the original paper specifies and what every production framework implements.

Adjusted t-test on the CUPED metric

Once you have y_cuped, the rest of the analysis is a standard two-sample test. Welch's t-test is the safe default because it does not assume equal variances across variants. The structure is the same as the raw t-test, but every input — the per-variant means, variances, and counts — is computed on y_cuped instead of y.

WITH cuped_stats AS (
    SELECT
        variant,
        COUNT(*) AS n,
        AVG(y_cuped) AS mean,
        VAR_SAMP(y_cuped) AS variance
    FROM adjusted
    GROUP BY variant
),
pair AS (
    SELECT
        MAX(CASE WHEN variant = 'A' THEN n END) AS n_a,
        MAX(CASE WHEN variant = 'A' THEN mean END) AS mean_a,
        MAX(CASE WHEN variant = 'A' THEN variance END) AS var_a,
        MAX(CASE WHEN variant = 'B' THEN n END) AS n_b,
        MAX(CASE WHEN variant = 'B' THEN mean END) AS mean_b,
        MAX(CASE WHEN variant = 'B' THEN variance END) AS var_b
    FROM cuped_stats
)
SELECT
    mean_b - mean_a AS diff_cuped,
    (mean_b - mean_a) / SQRT(var_a / n_a + var_b / n_b) AS t_statistic_cuped
FROM pair;

Because the variances feeding the denominator are smaller, the t-statistic is larger for the same observed difference. That is the entire point: significance arrives sooner, and stop-rules trigger on a test that would otherwise still be inconclusive.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Variance reduction in percent

Reviewers and stakeholders always want a single headline number. The reduction percentage is the cleanest summary because it maps directly to sample-size savings: cutting variance by 36 percent is equivalent to a 36 percent smaller required n for the same statistical power.

SELECT
    (1 - VAR_SAMP(y_cuped) / NULLIF(VAR_SAMP(y), 0)) * 100 AS variance_reduction_pct
FROM adjusted;

Typical numbers in production: 20 to 40 percent for continuous revenue metrics, 10 to 20 percent for session counts and engagement minutes, and close to zero for binary conversion metrics. The last case is the most common cause of "CUPED did nothing" — covering it in the pitfalls below.

Common pitfalls

The most expensive mistake is letting the covariate window overlap with the experiment window. If revenue_pre is computed over a range that includes any data after bucket_at, the covariate carries treatment signal. Theta will be inflated in the direction of the treatment effect, and the adjusted metric will be biased toward zero — your real lift gets erased. The fix is mechanical: define the covariate window strictly before bucket_at, per user, and never on a global clock that ignores when each user actually entered the experiment.

A close second is computing theta per variant. If you run COVAR_POP / VAR_POP separately in treatment and control, each theta partially absorbs the treatment effect, and the adjusted means no longer differ by what they should. The rule is pooled theta on the union of variants, single coefficient, then apply to both groups. This is non-negotiable for unbiased estimates and is the first thing a reviewer at any serious experimentation team will check.

CUPED on binary metrics is a different trap. Conversion is zero or one — there is almost no per-user variance for the covariate to explain. The correlation between "purchased last month" and "purchased this month" is positive but small in absolute terms, and the variance reduction usually comes out under 5 percent. CUPED is a tool for continuous outcomes. For binary metrics, look at stratified randomization, CUPAC, or just accept that you need a larger sample.

New users with no pre-period history are the next gotcha. If half your experiment is signups from the last seven days, their covariate is mostly zero or null, and theta is estimated almost entirely from the existing-user subset. The adjusted metric for new users is essentially identical to the raw metric, and the variance reduction blends with the existing-user reduction toward something disappointing. For pure new-user experiments, do not bother with CUPED — use a different variance-reduction technique like stratification on signup channel.

Finally, do not stack multiple covariates by computing several thetas independently and subtracting them all. The math falls apart because the covariates correlate with each other. The right generalization to multiple pre-period features is CUPAC — fit a regression of Y on the covariates, predict, and subtract the prediction centered on its own mean. Treat single-covariate CUPED as a special case; if you need more covariates, switch frameworks rather than hand-rolling sums of adjustments.

Optimization tips

The expensive part of the query is almost never the variance functions — it is the join from ab_users to orders. If your experiments table has millions of users and orders has billions of rows, join on (user_id, created_at >= bucket_at) with the right index or clustering key. On Snowflake, cluster orders by user_id and created_at; on BigQuery, partition by created_at and cluster by user_id; on Databricks, use liquid clustering on the same pair. If you are running CUPED nightly for many experiments, materialize the per-user (x, y) pair as a daily snapshot table keyed by experiment, then run the theta and adjustment steps on that snapshot — you avoid re-scanning the orders fact table across analyses.

A second knob is the covariate window length. Thirty days is the default, but the correlation with the in-experiment metric usually peaks somewhere between 14 and 60 days depending on how spiky the metric is. Compute theta across a few candidate windows and pick the one with the highest absolute correlation. Do this once when the experiment platform is set up, not per experiment, so the choice is not silently optimizing for false positives.

If you run hundreds of experiments per quarter at the scale of Airbnb or Notion, the third optimization is to precompute theta on a reference sample rather than recomputing per experiment. A globally-fit theta is slightly less optimal, but the variance of the theta estimate itself stops being a concern, and CUPED overhead drops to near zero.

If you want to drill problems like this every day, NAILDD is launching with 500+ SQL and stats problems from real analytics interviews.

FAQ

What covariate should I pick?

Any pre-experiment series that correlates with the outcome metric. For revenue experiments, pre-period revenue is the obvious choice. For retention experiments, pre-period active days or session count works well. For engagement experiments at a content platform, pre-period watch time or read time. The empirical rule is to pick whatever has the highest correlation with the outcome in your historical data; if two candidates correlate similarly, prefer the one with less missingness.

Is 20 percent variance reduction a lot?

In sample-size terms, a 20 percent variance reduction is equivalent to about a 20 percent smaller required n for the same minimum detectable effect and power. On a six-week experiment, that is a savings of more than a week of calendar time, which often decides whether a launch hits the quarterly target or slips. By industry standards, 20 percent is solid for revenue metrics and excellent for engagement metrics. Anything above 40 percent on a noisy metric usually means you accidentally leaked treatment signal into the covariate — go back and check.

Does CUPED change the point estimate?

No, and that is the property that makes it safe. The expected value of y_cuped equals the expected value of y, because the covariate is centered. What changes is the variance, and therefore the standard error and the confidence interval. The point estimate of the treatment effect — the difference of CUPED means — is an unbiased estimator of the same quantity as the raw difference of means, just with a tighter interval around it.

What is the generalization to multiple covariates?

The generalization is CUPAC, which fits a regression of Y on pre-period covariates and subtracts the prediction. You train an OLS or gradient-boosted model on the features, score it on the experiment users, and use the residual as the adjusted metric. Sample-size savings compound, but the framework is harder to defend in review because the adjustment is no longer linear in one number. Stick to single-covariate CUPED for explainability unless you need to push harder.

Can I run this in Postgres without an extension?

Yes. COVAR_POP, VAR_POP, and VAR_SAMP are part of standard SQL and are built into Postgres, Snowflake, BigQuery, Redshift, and Databricks SQL. No extensions, no UDFs, no Python — the entire pipeline above runs as plain SQL. If your warehouse is older than that, you can recover the same statistics from SUM, SUM of squares, and COUNT, but at that point you have bigger problems than CUPED.

How do I know if my CUPED implementation is correct?

Two sanity checks. The mean of y_cuped per variant should differ from the mean of y per variant by less than a fraction of the standard error — if it differs more, your covariate is not centered, usually because mean_x was computed per variant rather than pooled. The variance reduction should also be in the expected range for your metric type. Zero reduction on a continuous metric with known autocorrelation means theta was computed wrong or the join leaked rows. Above 60 percent on a real-world business metric almost certainly means contamination — recheck the covariate window.