Delta method in SQL for ratio metrics
Contents:
Why delta method matters
Ratio metrics break the standard t-test, and every analyst hits this wall the first time they try to ship a clean A/B readout. Picture a Tuesday morning at Meta. A PM on the Feed team pings you: the new ranking model has been running for ten days and clickthrough rate is up 1.7 percent. The product VP wants sign-off by end of day. You sketch a t-test in your head — sum the clicks, sum the views, divide, two-sample test on the difference. Then it hits you. The denominator is random. Every user contributes a different number of views, and SUM(views) is a random variable, not a constant. Treating CTR as a plain mean and slapping STDDEV / SQRT(n) on it produces a confidence interval that is wrong, sometimes by a factor of two, and you end up either shipping a coin-flip or killing a real win.
The delta method, also called linearization, is the standard fix. Google, Netflix, Microsoft, and Airbnb all use it in their internal experimentation platforms, and platforms like Eppo and Statsig ship it as the default for any ratio over the randomization unit. The trick is to turn the ratio into a per-user score, then run the variance calculation on that score. The math is a first-order Taylor expansion around the group means. The SQL is three CTEs. When a senior data scientist asks how you would estimate variance for CTR in an A/B test, this is the answer they want to hear.
The linearization idea
A ratio metric has the form R = X / Y, where both X and Y are sums aggregated over the randomization unit. For CTR, X is total clicks across users in the variant and Y is total views. The estimator R_hat = sum(x_i) / sum(y_i) is consistent, but its variance is not var(R) in the simple sense. Both numerator and denominator fluctuate, and they fluctuate together.
The delta method handles this by expanding R as a first-order Taylor series around the population means mu_X and mu_Y. After algebra you get an approximation that looks like this:
R approx mu_X / mu_Y + (1 / mu_Y) * (X - mu_X) - (mu_X / mu_Y^2) * (Y - mu_Y)Every term on the right is linear in either X or Y, which means you can rewrite the contribution of each user i as a single scalar:
r_i = (x_i - (mu_X / mu_Y) * y_i) / mu_YThis r_i is the linearized score for user i. The point estimate is still mu_X / mu_Y, but now the variance of the estimator equals var(r_i) / n. From that point on you treat r_i like any other per-user metric — Welch's t-test, confidence intervals, MDE calculations, CUPED adjustments all work on the linearized series. The hard step is computing mu_X and mu_Y from pooled data and joining them back to per-user rows, which is exactly what SQL is good at.
Delta method in SQL
Take the CTR case. Each row in ab_events is a single user-event with event in ('click', 'view'), a user_id, a variant, and an experiment_id. Randomization is per user. The query has three layers. First, collapse events to one row per user per variant. Second, compute the group means mu_X and mu_Y on pooled per-user data. Third, build the linearized score, then aggregate variance and standard error per variant.
WITH per_user AS (
SELECT
user_id,
variant,
SUM(CASE WHEN event = 'click' THEN 1 ELSE 0 END) AS clicks,
SUM(CASE WHEN event = 'view' THEN 1 ELSE 0 END) AS views
FROM ab_events
WHERE experiment_id = 'feed_v3'
GROUP BY user_id, variant
),
group_means AS (
SELECT
variant,
AVG(clicks::FLOAT) AS mean_x,
AVG(views::FLOAT) AS mean_y,
COUNT(*) AS n
FROM per_user
GROUP BY variant
),
linearized AS (
SELECT
u.user_id,
u.variant,
(u.clicks - gm.mean_x * u.views / NULLIF(gm.mean_y, 0))
/ NULLIF(gm.mean_y, 0) AS r_i
FROM per_user u
JOIN group_means gm USING (variant)
)
SELECT
gm.variant,
gm.mean_x / NULLIF(gm.mean_y, 0) AS ratio_point_estimate,
VAR_SAMP(l.r_i) / NULLIF(gm.n, 0) AS variance_of_ratio,
SQRT(VAR_SAMP(l.r_i) / NULLIF(gm.n, 0)) AS se_of_ratio,
gm.n
FROM linearized l
JOIN group_means gm USING (variant)
GROUP BY gm.variant, gm.mean_x, gm.mean_y, gm.n
ORDER BY gm.variant;The query returns one row per variant. ratio_point_estimate is the standard sum(clicks) / sum(views) recovered from the per-user averages — it equals the naive CTR. The new columns are variance_of_ratio and se_of_ratio, which are the values you actually need for the t-test. If you skip the linearization and just slap STDDEV(clicks::FLOAT / NULLIF(views, 0)) / SQRT(n) on the per-user table, you are computing the standard error of AVG(clicks_per_view), which is a different metric with a different expected value. That confusion produces about half the failed A/B reviews I have seen at the senior level.
The cast to FLOAT inside AVG matters in Postgres and a few warehouses, where AVG over integer columns returns a numeric with limited precision. Snowflake and BigQuery handle this automatically, but the explicit cast is a habit worth keeping.
Welch's t-test on the linearized series
Once you have r_i, the difference of group means is the difference of ratios, and the variance of the difference is the sum of the per-group variances of r_i divided by their respective sample sizes. Welch's t-test handles unequal variances and unequal sample sizes, which is what you want for A/B with 50/50 splits that became 49/51 after the dedup join. Materialize the linearized CTE from the previous query into a table or a view called linearized, then run:
WITH a_stats AS (
SELECT
AVG(r_i) AS mean_r,
VAR_SAMP(r_i) AS var_r,
COUNT(*) AS n
FROM linearized
WHERE variant = 'A'
),
b_stats AS (
SELECT
AVG(r_i) AS mean_r,
VAR_SAMP(r_i) AS var_r,
COUNT(*) AS n
FROM linearized
WHERE variant = 'B'
)
SELECT
a.mean_r AS mean_r_a,
b.mean_r AS mean_r_b,
b.mean_r - a.mean_r AS diff_ratio,
SQRT(a.var_r / a.n + b.var_r / b.n) AS se_diff,
(b.mean_r - a.mean_r)
/ NULLIF(SQRT(a.var_r / a.n + b.var_r / b.n), 0) AS t_statistic
FROM a_stats a, b_stats b;By construction r_i is centered around the group's own ratio, so mean_r_a and mean_r_b are near zero inside each variant. The useful quantity is diff_ratio, which equals (mu_X_B / mu_Y_B) - (mu_X_A / mu_Y_A). The t_statistic is what you compare to a Student's t critical value with degrees of freedom from the Welch-Satterthwaite formula. For sample sizes in the tens of thousands, the t distribution is indistinguishable from a normal, and any abs(t) > 1.96 corresponds to a two-sided p-value below 0.05. The 95 percent confidence interval for the difference is diff_ratio plus or minus 1.96 * se_diff.
Common pitfalls
When teams wire delta method into a pipeline for the first time, the most common failure mode is computing CTR = SUM(clicks) / SUM(views) and never touching variance. The point estimate is correct, and the dashboard looks fine. The problem shows up in the readout: the standard error column is either missing or filled with STDDEV(some per-row ratio) / SQRT(n), which has nothing to do with the variance of the ratio of sums. The fix is to always pair the point estimate with the linearized variance in the same query, so the two travel together in code review.
The second trap is reaching for AVG(clicks::FLOAT / views) per user and treating that as the metric. This is a micro-average over users instead of a macro-average across the full denominator. A user who saw one view and one click contributes 1.0 to the average, the same weight as a user who saw a thousand views and clicked five hundred times. For most product questions, the macro version is what you want, and that is what the delta method estimates.
A subtler mistake is linearizing at the wrong unit. If you bucket per user but linearize per session, the variance estimate ignores within-user clustering and underestimates the standard error. Sessions inside the same user are correlated, often strongly, and treating them as independent shrinks the confidence interval until everything looks significant. The rule is to linearize at the randomization unit, full stop.
Treating mu_Y as a fixed constant in the algebra and then plugging in the sample mean for it is a standard approximation, but it introduces a small bias when n is below a few thousand per variant. For small panels or rare-event metrics, consider bootstrap as a sanity check — if delta-method and bootstrap confidence intervals disagree by more than ten percent, the sample is too small for the asymptotic approximation and you need more data, not more math.
Finally, users with y_i = 0 blow up the linearization if you forget the NULLIF guard or, more importantly, if you forget to decide what zero means. A user who saw no views has no CTR — they did not participate in the metric at the per-user level. The standard fix is to exclude them and document the exclusion in the experiment design. If you keep them in, you are silently redefining CTR as "fraction of users who ever clicked", which is a different metric again.
Optimization tips
The expensive step is the second pass over per_user that joins against group_means. On Snowflake or BigQuery, both passes scan the same micro-partitions, and the planner will fuse them when it sees that per_user is referenced twice. On Postgres or Redshift, materialize per_user as a temporary table when the experiment has more than a few hundred million events — the downstream joins become single-pass against a much smaller table.
Pre-aggregating events into a per-user-per-day rollup that lives in the warehouse permanently is the single biggest win for delta-method pipelines. Once the daily rollup exists, every ratio metric in the experiment library reuses the same compact table, and variance queries finish in seconds instead of minutes. Databricks and Snowflake make this trivial with materialized views; on a self-hosted warehouse, a nightly DBT model does the same job. If you are computing delta-method variance for hundreds of metrics across a dozen running experiments, push the group_means CTE up into a shared layer and let every metric pipeline read from it.
Related reading
- How to calculate CUPED in SQL
- How to calculate CTR in SQL
- How to calculate confidence interval in SQL
- How to calculate bootstrap CI in SQL
- A/B testing peeking mistake
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
When do I actually need the delta method?
Any time your metric is a ratio of two sums where both numerator and denominator vary with the randomization unit. CTR per user, revenue per session per user, orders per visit per user, conversion per impression per user — these are the canonical cases. If your metric is a plain mean over the randomization unit, like average revenue per user, you do not need delta method because there is no random denominator. The shortcut: if the metric definition includes the word "per" twice, you probably need linearization.
Is bootstrap a valid alternative?
Yes, and it is the right choice when the delta-method approximation looks shaky. Bootstrap gives you a confidence interval without writing down a closed-form variance, at the cost of resampling the data hundreds or thousands of times. The two methods agree closely when n is in the tens of thousands per variant — when they disagree, trust bootstrap and investigate the sample. For pipelines that run hundreds of metrics per day, delta method wins on cost.
What if the denominator is zero for some users?
Exclude them or change the metric definition. A user with zero views has no CTR, and including them as 0 / 0 is undefined. The cleanest convention is to filter per_user to views > 0 and state explicitly that the metric is "CTR among users who saw at least one impression". If you want the click rate over the whole exposed population, your metric is really clicks per assigned user, which is a plain mean and does not need delta method at all.
Is delta method compatible with CUPED?
Yes, and the combination is standard. Linearize first to get r_i, then apply CUPED to r_i exactly as you would to any per-user metric. The pre-experiment covariate stays the same — typically a pre-experiment value of the same ratio or a closely related per-user metric. The two adjustments compose cleanly and the variance reduction stacks. Microsoft's original CUPED paper uses this combination for ratio metrics, and every serious experimentation platform ships them together.
Does it work when the randomization unit is something other than user?
Yes. The recipe replaces user_id with whatever you bucketed on — session, account, device, account-day. The constraint is consistency. If you bucketed on account_id, the per_user CTE becomes per_account and every other aggregation respects that grain. Mixing units, like bucketing on account_id and linearizing on user_id, breaks the variance estimate the same way clustering does. The unit of randomization is the unit of linearization, always.