How to calculate effect size in SQL
Contents:
Why effect size matters
It is Tuesday and the PM ships you a Slack message: the new pricing test at Stripe hit p < 0.001 on a 14M user sample, ship it? You stare at the dashboard. The conversion lift is 0.04 percentage points. The p-value is screaming "significant" because the sample is enormous, but the magnitude is invisible to anyone outside the dashboard. This is the gap that effect size fills. P-value tells you "is there an effect at all"; effect size tells you "how big is the effect in standardized units that you can compare across experiments". Without it, every A/B test at scale looks ship-worthy and the team ends up with a roadmap of microscopic wins that do not move the business.
Data analyst and data science interviews at Meta, Airbnb, Uber, DoorDash, and Netflix lean hard on this. The classic question is a variant of "your test is significant but the effect is 0.5%, do you ship?" Candidates who get offers cite an effect size statistic (Cohen's d or Cohen's h) and tie that number to a business-meaningful raw difference. Saying "Cohen's h is 0.04, well below small, raw uplift 0.04pp on a 12% base, we are underpowered against a meaningful MDE" lands. Saying "p < 0.05 so ship" does not.
The two effect size statistics to know cold are Cohen's d for continuous metrics like revenue or session length, and Cohen's h for proportions like conversion or retention rate. Both are computable in pure SQL using AVG, VAR_SAMP, SQRT, and ASIN, and both fit into a standard A/B reporting template alongside p-value and confidence intervals.
Cohen's d for continuous metrics
Cohen's d is the standardized mean difference between two groups. The formula is the raw mean difference divided by the pooled standard deviation, where pooled SD weights each group's variance by its degrees of freedom.
d = (mean_B - mean_A) / s_pooled
s_pooled = sqrt(((n_A - 1) * var_A + (n_B - 1) * var_B) / (n_A + n_B - 2))Why pooled SD? Each variant estimates the same underlying noise under the null, and pooling gives a tighter estimate. Why n_A + n_B - 2 and not n_A + n_B? You spent two degrees of freedom estimating two group means, so the variance estimator subtracts them to stay unbiased.
In SQL:
WITH stats AS (
SELECT
variant,
COUNT(*) AS n,
AVG(revenue) AS mean,
VAR_SAMP(revenue) AS variance
FROM ab_users
WHERE experiment_id = 'pricing_v2'
GROUP BY variant
),
pair AS (
SELECT
MAX(CASE WHEN variant = 'A' THEN n END) AS n_a,
MAX(CASE WHEN variant = 'A' THEN mean END) AS mean_a,
MAX(CASE WHEN variant = 'A' THEN variance END) AS var_a,
MAX(CASE WHEN variant = 'B' THEN n END) AS n_b,
MAX(CASE WHEN variant = 'B' THEN mean END) AS mean_b,
MAX(CASE WHEN variant = 'B' THEN variance END) AS var_b
FROM stats
)
SELECT
mean_a,
mean_b,
mean_b - mean_a AS raw_diff,
SQRT(((n_a - 1) * var_a + (n_b - 1) * var_b)::NUMERIC
/ NULLIF(n_a + n_b - 2, 0)) AS pooled_sd,
(mean_b - mean_a)
/ NULLIF(SQRT(((n_a - 1) * var_a + (n_b - 1) * var_b)::NUMERIC
/ (n_a + n_b - 2)), 0) AS cohens_d
FROM pair;NULLIF guards against a single-variant experiment divide-by-zero. The ::NUMERIC cast prevents integer overflow on huge n_A * variance products in Postgres; Snowflake and BigQuery promote implicitly but the cast does not hurt.
Worked example: at a marketplace like Airbnb, variant A has 250,000 bookers with mean revenue 184.20 and variance 9,801, variant B has 250,000 bookers with mean 186.50 and variance 10,000. Pooled SD is roughly 99.5, raw diff is 2.30, and Cohen's d is about 0.023. Well under the 0.2 small threshold — report it as "+1.2% revenue per booker, d = 0.023" rather than just "p < 0.001".
Cohen's h for proportions
For binary outcomes like conversion, signup, or 30-day retention, raw difference in proportions is not standardized. A jump from 1% to 2% is qualitatively different from 50% to 51%, and Cohen's h captures that via the arcsine variance-stabilizing transform.
phi_p = 2 * arcsin(sqrt(p))
h = phi_B - phi_AWITH conv_rates AS (
SELECT
variant,
AVG(CASE WHEN converted THEN 1.0 ELSE 0 END) AS p_hat,
COUNT(*) AS n
FROM ab_users
WHERE experiment_id = 'checkout_v2'
GROUP BY variant
),
phi AS (
SELECT
variant,
p_hat,
n,
2 * ASIN(SQRT(p_hat)) AS phi_transformed
FROM conv_rates
)
SELECT
MAX(CASE WHEN variant = 'A' THEN p_hat END) AS p_a,
MAX(CASE WHEN variant = 'B' THEN p_hat END) AS p_b,
MAX(CASE WHEN variant = 'B' THEN phi_transformed END)
- MAX(CASE WHEN variant = 'A' THEN phi_transformed END) AS cohens_h
FROM phi;ASIN in Postgres expects an argument in [-1, 1]. Since p_hat is a proportion in [0, 1], SQRT(p_hat) is also in [0, 1] and will not throw a domain error. BigQuery and Snowflake both name the function ASIN and behave identically.
Worked example: a checkout test at variant A 12.0% and variant B 12.5% on 1M users per arm. Raw uplift 0.5pp, relative uplift 4.2%, Cohen's h roughly 0.015 — below the small threshold of 0.2. Even a clean 4.2% relative win on a 12% base is a tiny standardized effect; the team should not call it a "large lift" in the readout.
Interpretation thresholds
Cohen's 1988 rule-of-thumb thresholds:
| Effect | d | h |
|---|---|---|
| Small | 0.2 | 0.2 |
| Medium | 0.5 | 0.5 |
| Large | 0.8 | 0.8 |
These thresholds are conventions from psychology, where effects above 0.5 are routine. In product analytics, Cohen's d of 0.1 on revenue is usually a meaningful win, and 0.5 is exceptional. Anchor to your team's historical baseline of past shipped wins, not Cohen's table. If your last 20 ships averaged d = 0.04 on revenue and this one is d = 0.12, that is a strong relative outcome even though the absolute label is "small".
Effect size vs p-value
The most useful single readout is a 2x2 grid of p-value crossed with effect size. It disambiguates the four scenarios you will actually encounter.
SELECT
p_value,
cohens_d,
CASE
WHEN p_value < 0.05 AND ABS(cohens_d) < 0.1
THEN 'significant but trivial effect, do NOT ship'
WHEN p_value < 0.05 AND ABS(cohens_d) >= 0.1
THEN 'significant AND meaningful, ship'
WHEN p_value >= 0.05 AND ABS(cohens_d) < 0.1
THEN 'no effect detected'
WHEN p_value >= 0.05 AND ABS(cohens_d) >= 0.1
THEN 'effect present, underpowered, increase n'
END AS interpretation
FROM ab_results;The "significant but trivial" cell catches most teams past a million users per arm. The "underpowered" cell catches teams running 5-day tests on a small surface and concluding "no effect" when they just did not collect enough data.
Common pitfalls
The most common mistake when teams calculate effect size for the first time is pooling standard deviation with n_A + n_B in the denominator instead of n_A + n_B - 2. The two-degree correction matters because the variance estimator needs to account for the degrees of freedom spent on the two group means. On samples in the hundreds of thousands the gap is invisible, but on a few hundred users per arm the biased version inflates the denominator and shrinks Cohen's d in a misleading direction. The fix is to always carry the -2 term and gate the SQL with NULLIF.
Another trap is using Cohen's d on a binary outcome. The formula evaluates, but the interpretation breaks because the variance of a Bernoulli is p * (1 - p), which collapses near 0 and 1, and the resulting d is not comparable across proportions on different bases. For any conversion, retention, click-through, or true/false metric, switch to Cohen's h. The arcsine transform stretches the scale near 0 and 1, making the standardized difference comparable whether the base rate is 2% or 50%.
A third pitfall is reporting effect size without the raw business number alongside it. Cohen's d of 0.1 on revenue could mean +5% revenue per user, which is enormous at Netflix scale, or +5 seconds on session length, which is a rounding error. Always pair the standardized number with raw diff and base rate. The readout that survives review is "raw diff = X, relative lift = Y%, Cohen's d = Z (label), p = W".
A fourth issue is using Hedges' g where it converges to Cohen's d. Hedges' g is a small-sample bias correction that multiplies d by approximately 1 - 3 / (4 * (n_A + n_B) - 9). For n_A + n_B > 50 the correction is under 2% and the team should just report Cohen's d for readability. Reserve Hedges' g for pilots where each arm has under 30 users.
A fifth and subtle trap is comparing Cohen's d across metrics with different scales and pretending the comparison is meaningful. A d of 0.3 on revenue and a d of 0.3 on time-on-page are not "the same magnitude" — the underlying standard deviations belong to different distributions. Effect size standardizes within a metric, not across metrics.
Optimization tips
For wide A/B tables with hundreds of millions of rows, the bottleneck in the Cohen's d query is the VAR_SAMP scan, not the final arithmetic. Partition ab_users by experiment_id so the filter is a partition prune. On Snowflake, cluster by experiment_id and variant. On BigQuery, partition-by-day on the assignment date and cluster by experiment so multi-experiment dashboards reuse cache slices.
For Cohen's h on a streaming conversion metric, materialize the conv_rates CTE into a per-day-per-variant aggregate. The arcsine transform is cheap; the row scan to compute p_hat is not. A daily summary table of (experiment_id, variant, day, n, conversions) keeps the final query at single-digit milliseconds across months of data.
When bootstrapping CIs for Cohen's d, do not join the source table N times. Use TABLESAMPLE BERNOULLI or a hashed-row resampler and aggregate 1,000 replicates in a single CTE. The pattern is in how to calculate bootstrap CI in SQL.
Related reading
- SQL window functions interview questions
- How to calculate chi-square test in SQL
- How to calculate conversion uplift in SQL
- How to calculate confidence interval in SQL
- A/B testing peeking mistake
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
When do I use Cohen's d versus Cohen's h?
Use Cohen's d for continuous metrics: revenue per user, session length, latency, basket size, time-to-first-action. Use Cohen's h for proportions or rates: conversion, signup, click-through, 7-day retention, churn flag. The deciding question is whether the underlying observation is a number on a continuous scale or a yes/no event. If yes/no, the arcsine-transformed h gives a standardized effect comparable across base rates that the raw difference does not.
What is the minimum practically significant effect size in product analytics?
It depends on product maturity and the cost of change. For an early-stage startup where every percentage point compounds, Cohen's d or h of 0.05 to 0.1 is often worth shipping if implementation cost is low. For a mature product where engineering time is scarce, the team typically needs 0.2 or larger to justify opportunity cost. Anchor to your historical record of shipped wins rather than Cohen's textbook table.
Does effect size capture the direction of the difference?
Yes. The sign of Cohen's d and Cohen's h indicates which variant won. A positive value with the convention d = (mean_B - mean_A) / s_pooled means B beat A; negative means A beat B. Make the convention explicit in readouts so reviewers do not have to guess which arm is the treatment.
Do I need Hedges' g instead of Cohen's d?
For sample sizes typical of product experiments at any company with real traffic, no. Hedges' g is a small-sample bias correction that converges to Cohen's d as n grows. With more than 30 users per arm the correction is under 5%, and over 50 per arm it is under 2%. Reserve Hedges' g for pilots and qualitative rounds where each arm has fewer than 30 observations.
Should I report a confidence interval on effect size?
Yes whenever the audience will read the number as a decision input. The cleanest CI on Cohen's d is bootstrap resampling: draw 1,000 samples with replacement, compute d on each, and take the 2.5th and 97.5th percentiles. The asymptotic approximation SE_d = sqrt((n_A + n_B) / (n_A * n_B) + d^2 / (2 * (n_A + n_B))) works for large samples and is what most stats libraries ship.
How does effect size interact with sample size planning?
Effect size is the input to power analysis. Before launching a test, pick a minimum detectable effect (MDE) in standardized units, a power level (usually 0.8), and a significance threshold (usually 0.05), then solve for the required n per arm. Smaller MDE, larger n. Reporting Cohen's d on past experiments builds the team's library of realistic MDEs to plan against — the single biggest unlock for running tests with enough power.