May 18, 2026·11 min read

SQL for A/B tests

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Contents:

Why analysts run A/B math in SQL
The core query stack
Standard error, z, and confidence intervals
Continuous metrics and CUPED
Segmentation, guardrails, and SRM
The interview walkthrough
Common pitfalls
Optimization tips
Related reading
FAQ

Why analysts run A/B math in SQL

The fastest way to lose credibility on an analytics team is to read a number from a notebook nobody else can reproduce. When a PM asks whether the new onboarding flow lifted activation, what they want is a query they can re-run Monday with a different cohort. That query lives in your warehouse. At Stripe, Airbnb, and DoorDash, the first analysis pass runs against the experiment events table; only edge cases (sequential tests, hierarchical models) move into Python.

Every senior analyst loop asks you to compute lift, standard error, and a confidence interval from raw assignment data. The interviewer is not testing scipy syntax. They are testing whether you spot a sample ratio mismatch in three seconds and explain why pooling variance across variants is wrong.

This post walks the full stack: counts and lift, standard error and z, confidence intervals, continuous metrics, CUPED, segmentation, guardrails, SRM, and the interview script. Snippets run on Snowflake, BigQuery, or Postgres with light dialect tweaks. Assume one row per assigned user with user_id, variant, assigned_at, converted (0/1), and revenue.

The core query stack

Start with per-variant counts. This is the table your PM will screenshot, so it has to be readable. Group by variant, count users, sum conversions, and express the rate as a fraction and a percentage so nobody multiplies in their head.

SELECT
    variant,
    COUNT(*)                  AS users,
    SUM(converted)            AS conversions,
    AVG(converted)            AS cr,
    AVG(converted) * 100      AS cr_pct
FROM experiment
GROUP BY variant;

The next layer is lift. Absolute lift is the difference in rates. Relative lift expresses that difference as a percentage of control, which is what stakeholders quote in launch reviews. Compute both, because 1 percentage point is a small story on a 40 percent baseline and a huge one on a 4 percent baseline.

WITH stats AS (
    SELECT
        AVG(CASE WHEN variant = 'control' THEN converted END) AS control_cr,
        AVG(CASE WHEN variant = 'test'    THEN converted END) AS test_cr
    FROM experiment
)
SELECT
    control_cr,
    test_cr,
    test_cr - control_cr                                  AS abs_diff,
    (test_cr - control_cr) / control_cr                   AS relative_lift,
    (test_cr - control_cr) / control_cr * 100             AS lift_pct
FROM stats;

Standard error, z, and confidence intervals

A lift number without uncertainty is just a vibe. The standard error of the difference between two proportions is the square root of p_c(1-p_c)/n_c + p_t(1-p_t)/n_t. Divide the difference by the SE to get the z-statistic and compare to 1.96 for a 95 percent two-sided test.

WITH stats AS (
    SELECT
        variant,
        AVG(converted) AS p,
        COUNT(*)       AS n
    FROM experiment
    GROUP BY variant
),
pivoted AS (
    SELECT
        MAX(CASE WHEN variant = 'control' THEN p END) AS p_c,
        MAX(CASE WHEN variant = 'control' THEN n END) AS n_c,
        MAX(CASE WHEN variant = 'test'    THEN p END) AS p_t,
        MAX(CASE WHEN variant = 'test'    THEN n END) AS n_t
    FROM stats
)
SELECT
    p_c,
    p_t,
    p_t - p_c                                                                   AS diff,
    SQRT(p_c * (1 - p_c) / n_c + p_t * (1 - p_t) / n_t)                         AS se,
    (p_t - p_c) / SQRT(p_c * (1 - p_c) / n_c + p_t * (1 - p_t) / n_t)           AS z_stat,
    CASE
        WHEN ABS((p_t - p_c) / SQRT(p_c * (1 - p_c) / n_c + p_t * (1 - p_t) / n_t)) > 1.96
        THEN 'significant'
        ELSE 'NOT significant'
    END                                                                         AS verdict
FROM pivoted;

Confidence intervals are the friendlier cousin. The 95 percent CI for the difference is diff +/- 1.96 * se. If the interval contains zero, you cannot reject the null at 5 percent. Most experimentation platforms display the CI rather than a p-value because PMs find it easier to reason about a range than a probability.

SELECT
    diff - 1.96 * se AS ci_lower,
    diff,
    diff + 1.96 * se AS ci_upper
FROM pivoted;

A true p-value requires the normal CDF, which most warehouses do not expose natively (Snowflake has NORMAL_CDF, BigQuery does not, Postgres needs an extension). The realistic move is to ship the z-statistic and CI from SQL and let anyone who wants a precise p-value paste the z into Python. CI plus z covers 95 percent of launch decisions.

Continuous metrics and CUPED

Conversion is easy because variance equals p(1-p). Revenue is harder because variance is whatever your customer distribution decides it is. For continuous metrics, the SE formula becomes sqrt(s1^2/n1 + s2^2/n2) where s is the sample standard deviation.

WITH stats AS (
    SELECT
        variant,
        AVG(revenue)     AS mean_rev,
        STDDEV(revenue)  AS sd_rev,
        COUNT(*)         AS n
    FROM experiment
    GROUP BY variant
)
SELECT
    MAX(CASE WHEN variant = 'test'    THEN mean_rev END) -
    MAX(CASE WHEN variant = 'control' THEN mean_rev END) AS diff,
    SQRT(
        MAX(CASE WHEN variant = 'test'    THEN sd_rev * sd_rev / n END) +
        MAX(CASE WHEN variant = 'control' THEN sd_rev * sd_rev / n END)
    ) AS se
FROM stats;

Revenue distributions are heavy-tailed. A single whale on the test side blows up variance and makes the test look noisy. The fix is CUPED, which uses a pre-experiment covariate to reduce variance. If pre_metric is the user's revenue in the 14 days before assignment, the adjusted metric is post - theta * (pre - mean_pre) where theta = cov(post, pre) / var(pre).

WITH theta_calc AS (
    SELECT
        COVAR_POP(post_metric, pre_metric) / VAR_POP(pre_metric) AS theta,
        AVG(pre_metric)                                          AS mean_pre
    FROM experiment
),
adjusted AS (
    SELECT
        e.variant,
        e.post_metric - t.theta * (e.pre_metric - t.mean_pre) AS cuped_metric
    FROM experiment e
    CROSS JOIN theta_calc t
)
SELECT
    variant,
    AVG(cuped_metric)    AS cuped_mean,
    STDDEV(cuped_metric) AS cuped_sd,
    COUNT(*)             AS n
FROM adjusted
GROUP BY variant;

A covariate with r = 0.6 between pre and post typically cuts variance by 30 to 40 percent, equivalent to running the experiment 60 percent longer for free.

Segmentation, guardrails, and SRM

Pooled results hide heterogeneous effects. A new checkout flow can lift desktop by 3 percent and tank mobile by 2 percent, and the average looks flat. Cut by the obvious dimensions: platform, country, new vs returning, paid vs organic.

SELECT
    variant,
    platform,
    COUNT(*)         AS n,
    AVG(converted)   AS cr
FROM experiment
GROUP BY variant, platform
ORDER BY platform, variant;

Treat these as exploratory, not confirmatory. If you slice 12 ways and one shows p less than 0.05, that is what randomness looks like. Pre-register one or two segments you care about and bonferroni-correct the rest.

Guardrails catch the case where the primary metric moves the right way but something secondary breaks. The classic example: a checkout test lifts conversion but spikes refunds, leaving net revenue flat. Report guardrails alongside the primary, every time.

SELECT
    variant,
    AVG(converted)        AS cr,
    AVG(revenue)          AS arpu,
    AVG(refund)           AS refund_rate,
    AVG(session_count)    AS avg_sessions,
    AVG(crashed)          AS crash_rate
FROM experiment
GROUP BY variant;

Sample Ratio Mismatch is the silent killer. You shipped a 50/50 split, but the data shows 53/47. The cause is almost always a bug: a redirect that fires before assignment logs, a bot filter that asymmetrically removes traffic, a frontend race. If SRM is present, every other number is suspect and the analysis should be paused, not corrected.

WITH counts AS (
    SELECT variant, COUNT(*) AS n
    FROM experiment
    GROUP BY variant
),
totals AS (
    SELECT SUM(n) AS total FROM counts
)
SELECT
    c.variant,
    c.n,
    c.n * 1.0 / t.total                  AS observed_pct,
    0.5                                  AS expected_pct,
    POWER(c.n - t.total * 0.5, 2) / (t.total * 0.5) AS chi_sq_contribution
FROM counts c
CROSS JOIN totals t;

Sum chi_sq_contribution. If the total exceeds 3.84 on one degree of freedom, you have a meaningful imbalance and you stop.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

The interview walkthrough

Classic prompt: "Control conversion is 10 percent, test is 11 percent, n equals 50,000 per arm. Significant?" The interviewer wants the pattern, not a memorized number.

Start with the difference: 1 point absolute, 10 percent relative. State both. Compute the standard error out loud: sqrt(0.1 * 0.9 / 50000 + 0.11 * 0.89 / 50000) is roughly 0.00194. Divide the 0.01 difference by 0.00194 and you get a z near 5.15, well above 1.96, so significant at 95 percent. Then immediately add the guardrails: did revenue per user move, did SRM pass, did one platform drive the result. Showing that sequence is what gets you the offer.

For a continuous prompt ("revenue went from $4.20 to $4.35 with n equals 20,000"), the move is the same but you need standard deviation. If they do not give you one, ask. If they refuse, assume CV around 2 for marketplace revenue and reason about magnitude.

Common pitfalls

Reporting only the primary metric is the single most common mistake. The launch hits the dashboard, ships, and three weeks later support discovers refunds doubled. Guardrails are not optional; write them into the analysis template before the experiment starts.

Pooling variance across variants is a subtle SQL trap. Some textbooks teach the pooled-proportion formula for the standard error under the null. That formula is fine for the z-test of the null, but the confidence interval uses unpooled variance. Mixing them in the same query produces a CI that does not match the verdict.

Forgetting the SRM check turns confident launches into post-mortems. A 50.5/49.5 split sounds harmless, but with millions of users that is thousands of missing assignments, and the bias correlates with whatever broke assignment in the first place. Run the chi-square check before anything else, and bail out if it fails.

Misinterpreting a small p-value as a large effect is the analyst version of confusing statistical and practical significance. With 10 million users, a 0.1 percentage point lift is significant at p less than 0.001 and worth nearly nothing in dollars. Always pair the p-value or CI with effect size translated to revenue or sessions.

Peeking at the dashboard daily and stopping when it crosses 1.96 is a classic way to ship noise. Without sequential corrections, this inflates the false positive rate from 5 percent to roughly 30 percent over two weeks. Lock the stop date before the test starts, or use a sequential framework.

Optimization tips

For experiment tables larger than a few hundred million rows, the cheapest win is to materialize per-day per-variant aggregates and compute lift and SE from the rollup, not from raw events. Carry enough columns to support segmentation: variant, day, platform, country, conversions, users, sum_revenue, sum_revenue_squared. With sum_revenue and sum_revenue_squared you can reconstruct variance without retaining raw rows.

Cluster or partition the raw experiment table on assigned_at and experiment_id. Most A/B queries are scoped to one experiment over a fixed window. Date partitioning alone is not enough if you run dozens of concurrent tests. Add experiment_id to the cluster key on Snowflake and BigQuery; in Postgres use a multi-column index.

Use APPROX_COUNT_DISTINCT and APPROX_QUANTILES (BigQuery), HLL_COUNT_DISTINCT (Snowflake), or percentile_disc with sampling (Postgres) for distinct counts and tail percentiles. Exact versions are O(n log n) or worse and will blow up your slot quota. Approximate is fine for monitoring; switch to exact only for the final launch number.

If you want to drill A/B SQL questions like these on schedule, NAILDD is launching with hundreds of SQL problems built around exactly this pattern.

FAQ

Do I need Python to analyze A/B tests properly?

For monitoring, lift calculations, CIs, segmentation, and SRM checks, SQL is enough and arguably better because the entire team can re-run the query. Python becomes necessary for exact p-values from non-standard distributions, sequential testing with mSPRT or alpha-spending, hierarchical models, or bootstrap with custom statistics. Most companies run 80 to 90 percent of routine readouts in SQL and reach for Python only for the long tail.

How do I compute a p-value without a normal CDF?

Three options. Ship the z-statistic and CI from SQL; the team can read z greater than 1.96 as significant at 95 percent without an exact p. Or write a polynomial approximation of the normal CDF as a UDF (Abramowitz and Stegun is accurate to 1e-5). Or export z to Python and run scipy.stats.norm.sf. The first covers most decisions.

How do I handle assignment when users see both arms?

This means the assignment layer is broken or the unit of randomization does not match the unit of analysis. If users genuinely flip variants (cross-contamination), the experiment is invalid and a SQL fix cannot rescue it. If users were misassigned due to a logging race, keep the first assignment per user and drop the rest. Document the exclusion rule and make sure SRM is computed on the post-exclusion sample.

Can I trust standard SQL `STDDEV` for revenue metrics?

Yes for sample standard deviation. Most warehouses compute the n-minus-1 version by default, which is what you want. But revenue has fat tails, so a single outlier can move STDDEV by 30 percent. Trim the top 0.1 percent or winsorize before computing variance, and report both trimmed and untrimmed.

How long should I run an A/B test before I read it?

Long enough to hit the pre-registered sample size from power analysis, and long enough to cover a full weekly cycle. Reading at day three because the dashboard looks good is how you ship noise. Sequential frameworks let you peek earlier with controlled false positive rates; vanilla z-tests do not.