SQL for A/B tests
Contents:
Why analysts run A/B math in SQL
The fastest way to lose credibility on an analytics team is to read a number from a notebook nobody else can reproduce. When a PM asks whether the new onboarding flow lifted activation, what they want is a query they can re-run Monday with a different cohort. That query lives in your warehouse. At Stripe, Airbnb, and DoorDash, the first analysis pass runs against the experiment events table; only edge cases (sequential tests, hierarchical models) move into Python.
Every senior analyst loop asks you to compute lift, standard error, and a confidence interval from raw assignment data. The interviewer is not testing scipy syntax. They are testing whether you spot a sample ratio mismatch in three seconds and explain why pooling variance across variants is wrong.
This post walks the full stack: counts and lift, standard error and z, confidence intervals, continuous metrics, CUPED, segmentation, guardrails, SRM, and the interview script. Snippets run on Snowflake, BigQuery, or Postgres with light dialect tweaks. Assume one row per assigned user with user_id, variant, assigned_at, converted (0/1), and revenue.
The core query stack
Start with per-variant counts. This is the table your PM will screenshot, so it has to be readable. Group by variant, count users, sum conversions, and express the rate as a fraction and a percentage so nobody multiplies in their head.
SELECT
variant,
COUNT(*) AS users,
SUM(converted) AS conversions,
AVG(converted) AS cr,
AVG(converted) * 100 AS cr_pct
FROM experiment
GROUP BY variant;The next layer is lift. Absolute lift is the difference in rates. Relative lift expresses that difference as a percentage of control, which is what stakeholders quote in launch reviews. Compute both, because 1 percentage point is a small story on a 40 percent baseline and a huge one on a 4 percent baseline.
WITH stats AS (
SELECT
AVG(CASE WHEN variant = 'control' THEN converted END) AS control_cr,
AVG(CASE WHEN variant = 'test' THEN converted END) AS test_cr
FROM experiment
)
SELECT
control_cr,
test_cr,
test_cr - control_cr AS abs_diff,
(test_cr - control_cr) / control_cr AS relative_lift,
(test_cr - control_cr) / control_cr * 100 AS lift_pct
FROM stats;Standard error, z, and confidence intervals
A lift number without uncertainty is just a vibe. The standard error of the difference between two proportions is the square root of p_c(1-p_c)/n_c + p_t(1-p_t)/n_t. Divide the difference by the SE to get the z-statistic and compare to 1.96 for a 95 percent two-sided test.
WITH stats AS (
SELECT
variant,
AVG(converted) AS p,
COUNT(*) AS n
FROM experiment
GROUP BY variant
),
pivoted AS (
SELECT
MAX(CASE WHEN variant = 'control' THEN p END) AS p_c,
MAX(CASE WHEN variant = 'control' THEN n END) AS n_c,
MAX(CASE WHEN variant = 'test' THEN p END) AS p_t,
MAX(CASE WHEN variant = 'test' THEN n END) AS n_t
FROM stats
)
SELECT
p_c,
p_t,
p_t - p_c AS diff,
SQRT(p_c * (1 - p_c) / n_c + p_t * (1 - p_t) / n_t) AS se,
(p_t - p_c) / SQRT(p_c * (1 - p_c) / n_c + p_t * (1 - p_t) / n_t) AS z_stat,
CASE
WHEN ABS((p_t - p_c) / SQRT(p_c * (1 - p_c) / n_c + p_t * (1 - p_t) / n_t)) > 1.96
THEN 'significant'
ELSE 'NOT significant'
END AS verdict
FROM pivoted;Confidence intervals are the friendlier cousin. The 95 percent CI for the difference is diff +/- 1.96 * se. If the interval contains zero, you cannot reject the null at 5 percent. Most experimentation platforms display the CI rather than a p-value because PMs find it easier to reason about a range than a probability.
SELECT
diff - 1.96 * se AS ci_lower,
diff,
diff + 1.96 * se AS ci_upper
FROM pivoted;A true p-value requires the normal CDF, which most warehouses do not expose natively (Snowflake has NORMAL_CDF, BigQuery does not, Postgres needs an extension). The realistic move is to ship the z-statistic and CI from SQL and let anyone who wants a precise p-value paste the z into Python. CI plus z covers 95 percent of launch decisions.
Continuous metrics and CUPED
Conversion is easy because variance equals p(1-p). Revenue is harder because variance is whatever your customer distribution decides it is. For continuous metrics, the SE formula becomes sqrt(s1^2/n1 + s2^2/n2) where s is the sample standard deviation.
WITH stats AS (
SELECT
variant,
AVG(revenue) AS mean_rev,
STDDEV(revenue) AS sd_rev,
COUNT(*) AS n
FROM experiment
GROUP BY variant
)
SELECT
MAX(CASE WHEN variant = 'test' THEN mean_rev END) -
MAX(CASE WHEN variant = 'control' THEN mean_rev END) AS diff,
SQRT(
MAX(CASE WHEN variant = 'test' THEN sd_rev * sd_rev / n END) +
MAX(CASE WHEN variant = 'control' THEN sd_rev * sd_rev / n END)
) AS se
FROM stats;Revenue distributions are heavy-tailed. A single whale on the test side blows up variance and makes the test look noisy. The fix is CUPED, which uses a pre-experiment covariate to reduce variance. If pre_metric is the user's revenue in the 14 days before assignment, the adjusted metric is post - theta * (pre - mean_pre) where theta = cov(post, pre) / var(pre).
WITH theta_calc AS (
SELECT
COVAR_POP(post_metric, pre_metric) / VAR_POP(pre_metric) AS theta,
AVG(pre_metric) AS mean_pre
FROM experiment
),
adjusted AS (
SELECT
e.variant,
e.post_metric - t.theta * (e.pre_metric - t.mean_pre) AS cuped_metric
FROM experiment e
CROSS JOIN theta_calc t
)
SELECT
variant,
AVG(cuped_metric) AS cuped_mean,
STDDEV(cuped_metric) AS cuped_sd,
COUNT(*) AS n
FROM adjusted
GROUP BY variant;A covariate with r = 0.6 between pre and post typically cuts variance by 30 to 40 percent, equivalent to running the experiment 60 percent longer for free.
Segmentation, guardrails, and SRM
Pooled results hide heterogeneous effects. A new checkout flow can lift desktop by 3 percent and tank mobile by 2 percent, and the average looks flat. Cut by the obvious dimensions: platform, country, new vs returning, paid vs organic.
SELECT
variant,
platform,
COUNT(*) AS n,
AVG(converted) AS cr
FROM experiment
GROUP BY variant, platform
ORDER BY platform, variant;Treat these as exploratory, not confirmatory. If you slice 12 ways and one shows p less than 0.05, that is what randomness looks like. Pre-register one or two segments you care about and bonferroni-correct the rest.
Guardrails catch the case where the primary metric moves the right way but something secondary breaks. The classic example: a checkout test lifts conversion but spikes refunds, leaving net revenue flat. Report guardrails alongside the primary, every time.
SELECT
variant,
AVG(converted) AS cr,
AVG(revenue) AS arpu,
AVG(refund) AS refund_rate,
AVG(session_count) AS avg_sessions,
AVG(crashed) AS crash_rate
FROM experiment
GROUP BY variant;Sample Ratio Mismatch is the silent killer. You shipped a 50/50 split, but the data shows 53/47. The cause is almost always a bug: a redirect that fires before assignment logs, a bot filter that asymmetrically removes traffic, a frontend race. If SRM is present, every other number is suspect and the analysis should be paused, not corrected.
WITH counts AS (
SELECT variant, COUNT(*) AS n
FROM experiment
GROUP BY variant
),
totals AS (
SELECT SUM(n) AS total FROM counts
)
SELECT
c.variant,
c.n,
c.n * 1.0 / t.total AS observed_pct,
0.5 AS expected_pct,
POWER(c.n - t.total * 0.5, 2) / (t.total * 0.5) AS chi_sq_contribution
FROM counts c
CROSS JOIN totals t;Sum chi_sq_contribution. If the total exceeds 3.84 on one degree of freedom, you have a meaningful imbalance and you stop.
The interview walkthrough
Classic prompt: "Control conversion is 10 percent, test is 11 percent, n equals 50,000 per arm. Significant?" The interviewer wants the pattern, not a memorized number.
Start with the difference: 1 point absolute, 10 percent relative. State both. Compute the standard error out loud: sqrt(0.1 * 0.9 / 50000 + 0.11 * 0.89 / 50000) is roughly 0.00194. Divide the 0.01 difference by 0.00194 and you get a z near 5.15, well above 1.96, so significant at 95 percent. Then immediately add the guardrails: did revenue per user move, did SRM pass, did one platform drive the result. Showing that sequence is what gets you the offer.
For a continuous prompt ("revenue went from $4.20 to $4.35 with n equals 20,000"), the move is the same but you need standard deviation. If they do not give you one, ask. If they refuse, assume CV around 2 for marketplace revenue and reason about magnitude.
Common pitfalls
Reporting only the primary metric is the single most common mistake. The launch hits the dashboard, ships, and three weeks later support discovers refunds doubled. Guardrails are not optional; write them into the analysis template before the experiment starts.
Pooling variance across variants is a subtle SQL trap. Some textbooks teach the pooled-proportion formula for the standard error under the null. That formula is fine for the z-test of the null, but the confidence interval uses unpooled variance. Mixing them in the same query produces a CI that does not match the verdict.
Forgetting the SRM check turns confident launches into post-mortems. A 50.5/49.5 split sounds harmless, but with millions of users that is thousands of missing assignments, and the bias correlates with whatever broke assignment in the first place. Run the chi-square check before anything else, and bail out if it fails.
Misinterpreting a small p-value as a large effect is the analyst version of confusing statistical and practical significance. With 10 million users, a 0.1 percentage point lift is significant at p less than 0.001 and worth nearly nothing in dollars. Always pair the p-value or CI with effect size translated to revenue or sessions.
Peeking at the dashboard daily and stopping when it crosses 1.96 is a classic way to ship noise. Without sequential corrections, this inflates the false positive rate from 5 percent to roughly 30 percent over two weeks. Lock the stop date before the test starts, or use a sequential framework.
Optimization tips
For experiment tables larger than a few hundred million rows, the cheapest win is to materialize per-day per-variant aggregates and compute lift and SE from the rollup, not from raw events. Carry enough columns to support segmentation: variant, day, platform, country, conversions, users, sum_revenue, sum_revenue_squared. With sum_revenue and sum_revenue_squared you can reconstruct variance without retaining raw rows.
Cluster or partition the raw experiment table on assigned_at and experiment_id. Most A/B queries are scoped to one experiment over a fixed window. Date partitioning alone is not enough if you run dozens of concurrent tests. Add experiment_id to the cluster key on Snowflake and BigQuery; in Postgres use a multi-column index.
Use APPROX_COUNT_DISTINCT and APPROX_QUANTILES (BigQuery), HLL_COUNT_DISTINCT (Snowflake), or percentile_disc with sampling (Postgres) for distinct counts and tail percentiles. Exact versions are O(n log n) or worse and will blow up your slot quota. Approximate is fine for monitoring; switch to exact only for the final launch number.
Related reading
- SQL window functions interview questions
- SRM (Sample Ratio Mismatch) explained
- CUPED for variance reduction in A/B testing
- Guardrail metrics in A/B testing
- A/B testing peeking mistake
- How to calculate Bonferroni correction in SQL
If you want to drill A/B SQL questions like these on schedule, NAILDD is launching with hundreds of SQL problems built around exactly this pattern.
FAQ
Do I need Python to analyze A/B tests properly?
For monitoring, lift calculations, CIs, segmentation, and SRM checks, SQL is enough and arguably better because the entire team can re-run the query. Python becomes necessary for exact p-values from non-standard distributions, sequential testing with mSPRT or alpha-spending, hierarchical models, or bootstrap with custom statistics. Most companies run 80 to 90 percent of routine readouts in SQL and reach for Python only for the long tail.
How do I compute a p-value without a normal CDF?
Three options. Ship the z-statistic and CI from SQL; the team can read z greater than 1.96 as significant at 95 percent without an exact p. Or write a polynomial approximation of the normal CDF as a UDF (Abramowitz and Stegun is accurate to 1e-5). Or export z to Python and run scipy.stats.norm.sf. The first covers most decisions.
How do I handle assignment when users see both arms?
This means the assignment layer is broken or the unit of randomization does not match the unit of analysis. If users genuinely flip variants (cross-contamination), the experiment is invalid and a SQL fix cannot rescue it. If users were misassigned due to a logging race, keep the first assignment per user and drop the rest. Document the exclusion rule and make sure SRM is computed on the post-exclusion sample.
Can I trust standard SQL STDDEV for revenue metrics?
Yes for sample standard deviation. Most warehouses compute the n-minus-1 version by default, which is what you want. But revenue has fat tails, so a single outlier can move STDDEV by 30 percent. Trim the top 0.1 percent or winsorize before computing variance, and report both trimmed and untrimmed.
How long should I run an A/B test before I read it?
Long enough to hit the pre-registered sample size from power analysis, and long enough to cover a full weekly cycle. Reading at day three because the dashboard looks good is how you ship noise. Sequential frameworks let you peek earlier with controlled false positive rates; vanilla z-tests do not.