How to calculate DiD in SQL
Contents:
Why DiD matters when you cannot randomize
Difference-in-differences is the workhorse causal inference method for the situation every analyst eventually walks into: a product change shipped to everyone, a price increase rolled out by region, a marketing campaign turned on for one country, and someone in leadership wants the lift number by Friday. A real A/B test is off the table because the treatment was not random. Naive before-and-after comparisons are also off the table because the world keeps moving, and last quarter's seasonality, churn, and macro shocks end up baked into your estimate. DiD splits the difference: compare the change in the treated group to the change in a control group over the same window, and the difference of those two differences is the estimated effect.
The scenario most analysts at Stripe, DoorDash, Airbnb, or Notion will recognize: a PM ships a new pricing page to users in three states on April 15th and asks on May 14th whether revenue moved. You cannot randomize because legal approved the rollout by geo. What you can do is pick the rest of the US as a comparison, line up a clean pre-period and post-period, and compute a DiD estimate the engineering manager can defend in a planning meeting. The interview question that lands you a senior analytics role is rarely "what is DiD" — it is "write the SQL, show me the parallel-trends plot, and tell me when the estimate breaks". This post walks the recipe end to end with a runnable query, the math you need to defend in code review, and the traps hiring managers know good candidates fall into.
A second reason DiD shows up so often: it scales. Once the 2x2 table is in SQL, the same skeleton extends to event-study plots, staggered rollouts with multiple treatment cohorts, and regression form with fixed effects.
The 2x2 table in one paragraph
DiD lives in a four-cell table. Rows are groups, columns are periods. Cell A is the treated group's metric before the intervention; cell B is the treated group after; cell C is the control group before; cell D is the control group after. The treated change is B minus A. The control change is D minus C. The DiD estimate is the treated change minus the control change: (B minus A) minus (D minus C). The estimate is unbiased for the causal effect only when the two groups would have moved in parallel in the absence of the intervention — the parallel-trends assumption, which we check explicitly below.
| Group | Pre | Post | Change |
|---|---|---|---|
| Treated | A | B | B minus A |
| Control | C | D | D minus C |
DiD equals (B minus A) minus (D minus C). Two subtractions, one number, one assumption to defend.
DiD in SQL
The query has three layers. First, assemble per-user pre and post revenue using filtered aggregates so each user appears once. Second, average those numbers by treatment status to fill the four cells of the 2x2. Third, plug the cells into the DiD formula in a final scalar query. Run this against any modern warehouse — Snowflake, BigQuery, Redshift, Postgres, Databricks all support FILTER (WHERE ...) or a CASE rewrite.
WITH base AS (
SELECT
u.user_id,
u.is_treated,
SUM(o.amount) FILTER (
WHERE o.order_date BETWEEN DATE '2026-03-01' AND DATE '2026-03-31'
) AS revenue_pre,
SUM(o.amount) FILTER (
WHERE o.order_date BETWEEN DATE '2026-04-15' AND DATE '2026-05-14'
) AS revenue_post
FROM users u
LEFT JOIN orders o USING (user_id)
GROUP BY u.user_id, u.is_treated
),
cells AS (
SELECT
is_treated,
AVG(COALESCE(revenue_pre, 0)) AS mean_pre,
AVG(COALESCE(revenue_post, 0)) AS mean_post
FROM base
GROUP BY is_treated
)
SELECT
MAX(CASE WHEN is_treated THEN mean_pre END) AS treated_pre,
MAX(CASE WHEN is_treated THEN mean_post END) AS treated_post,
MAX(CASE WHEN NOT is_treated THEN mean_pre END) AS control_pre,
MAX(CASE WHEN NOT is_treated THEN mean_post END) AS control_post,
(MAX(CASE WHEN is_treated THEN mean_post END)
- MAX(CASE WHEN is_treated THEN mean_pre END))
- (MAX(CASE WHEN NOT is_treated THEN mean_post END)
- MAX(CASE WHEN NOT is_treated THEN mean_pre END)) AS did_estimate
FROM cells;Two things worth noting. The COALESCE inside AVG is intentional — without it, a user with zero orders in the pre-period becomes a NULL that silently drops out of the average, shifting the denominator and biasing the estimate upward. The MAX(CASE WHEN ...) pivot is a portable way to reshape the long cells table into wide columns without depending on a PIVOT keyword spelled differently across warehouses.
Parallel-trends check
DiD only earns its causal interpretation when the treated and control groups would have moved in parallel without the intervention. There is no statistical test that proves this — the counterfactual is by definition unobserved — but you can build evidence by plotting weekly averages over the pre-period and eyeballing whether the two lines look parallel. Any modern reviewer will ask for this plot before they trust the headline number.
SELECT
DATE_TRUNC('week', o.order_date)::DATE AS week,
u.is_treated,
AVG(o.amount) AS avg_revenue,
COUNT(DISTINCT u.user_id) AS active_users
FROM users u
JOIN orders o USING (user_id)
WHERE o.order_date BETWEEN DATE '2025-12-01' AND DATE '2026-04-14' -- pre-period only
GROUP BY 1, 2
ORDER BY 1, 2;Paste the output into a notebook, plot the two series, and look for three things. Are the lines roughly parallel over the whole pre-period, including the eight weeks closest to the intervention date? Do they share the same seasonality bumps — Black Friday, the holiday week, the post-New-Year drop? Is there any sign of an anticipation effect, where one group started moving right before the intervention? If the answers are not all "yes", DiD is biased, and the cleanest path forward is a better control group, synthetic control, or an event study with leads and lags.
DiD as a regression
The 2x2 estimate equals the interaction coefficient in an OLS regression of the outcome on a treated dummy, a post dummy, and their product. Writing it as a regression makes DiD composable with controls, fixed effects, and clustered standard errors. The model:
y = alpha + beta_treated * treated + beta_post * post + beta_did * (treated * post) + epsilonbeta_did is the DiD estimate. In SQL the four-coefficient regression can be backed out from filtered means, which is useful for a quick sanity check that your warehouse number matches what statsmodels or scikit-learn returns in Python.
SELECT
AVG(CASE WHEN NOT is_treated AND NOT is_post THEN amount END)
AS alpha,
AVG(CASE WHEN is_treated AND NOT is_post THEN amount END)
- AVG(CASE WHEN NOT is_treated AND NOT is_post THEN amount END)
AS beta_treated,
AVG(CASE WHEN NOT is_treated AND is_post THEN amount END)
- AVG(CASE WHEN NOT is_treated AND NOT is_post THEN amount END)
AS beta_post,
(AVG(CASE WHEN is_treated AND is_post THEN amount END)
- AVG(CASE WHEN is_treated AND NOT is_post THEN amount END))
- (AVG(CASE WHEN NOT is_treated AND is_post THEN amount END)
- AVG(CASE WHEN NOT is_treated AND NOT is_post THEN amount END))
AS beta_did
FROM observations;For confidence intervals and clustered standard errors, run the same regression in Python with statsmodels.formula.api.ols('amount ~ treated * post', data=df).fit(cov_type='cluster', cov_kwds={'groups': df['user_id']}). The point estimate matches the SQL above to floating-point precision; the standard error is what you cannot get from AVG. Clustering at the unit of assignment — user, store, or region — is what stops you from reporting an artificially tight CI.
Common pitfalls
The first and most expensive pitfall is broken parallel trends. If the two groups were already diverging in the pre-period, DiD will hand you a number that looks crisp and is mostly noise from the pre-existing drift. Reviewers catch this in the first plot, and the fix is either a better control group, a synthetic control built from a weighted basket of comparison units, or an event-study specification with placebo leads that demonstrates how flat the pre-period really is.
The second pitfall is selection bias inside the treated group. If the treatment was assigned to the top spenders, the most engaged accounts, or the cohort the growth team handpicked because they were already ramping, the DiD estimate is contaminated by mean reversion or continued growth that has nothing to do with the intervention. The fix is either propensity-score matching to build a comparable control, restricting to a sharp cutoff that approximates random assignment, or moving to a regression-discontinuity design when one is available.
The third pitfall is an external shock in the post-period that hits one group harder than the other. A national holiday, a competitor outage, a viral news cycle, or a payment-processor incident can swing one group's metric in ways the other group does not see, and DiD will attribute the entire swing to the treatment. The fix is to widen the post-window so the shock washes out, exclude the contaminated days, or add a control for the shock in the regression form.
The fourth pitfall is a control group that is too small or too different. If your control has a few hundred users while the treated group has tens of thousands, the post-period mean is noisy and the CI on the DiD estimate explodes. If the control differs systematically on plan, geography, or tenure, even a parallel pre-period will not save you because composition changes drive the post-period. The fix is to enforce a minimum control size — at least a few thousand active users where possible — and to balance the groups on observable covariates before computing the DiD.
The fifth pitfall is ignoring within-unit correlation when you compute standard errors. A naive STDDEV over user-period rows treats every observation as independent, which it is not — the same user shows up in both pre and post. Cluster at the user, account, or region, or bootstrap at the same level if the regression machinery is not handy.
Optimization tips
Three optimizations matter at warehouse scale. First, partition the orders table by order_date and add a covering index on (user_id, order_date) if your engine supports it, so the filtered aggregates touch only the two month-long windows instead of the full history. Second, materialize the per-user pre and post sums once and reuse them across the parallel-trends plot, the headline DiD, and the segment cuts — recomputing the same SUM ... FILTER three times will dominate the plan on billion-row tables. Third, if the design is panel data with hundreds of periods, drop into a regression in Python or DuckDB with fixed effects rather than fanning the cells out in SQL; you will hit a wall on memory and readability long before correctness.
Related reading
- SQL window functions interview questions
- How to calculate conversion uplift in SQL
- How to calculate CUPED in SQL
- A/B testing peeking mistake
If you want to drill causal-inference SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
When should I use DiD instead of an A/B test?
Use a real A/B test whenever you can randomize at the unit you care about. DiD is the right tool when randomization is impossible or impractical — geo rollouts, policy changes, marketing campaigns aimed at the whole user base, and price tests where legal blocks split treatment. The trade-off is that DiD relies on the parallel-trends assumption, which you have to argue for, while a clean A/B test relies only on random assignment, which the experiment platform enforces.
What if parallel trends are clearly violated?
Switch to a method that does not require them. Synthetic control builds a weighted basket of untreated units that matches the treated unit's pre-period trajectory, and the post-period gap between the synthetic and real treated unit is the estimate. Event studies plot the treatment effect at each lead and lag, so the reviewer can see whether the pre-trend is flat or sloped. Regression discontinuity exploits a sharp cutoff in assignment when one exists. Each has its own assumptions, but these are the standard fallbacks.
Can DiD work with only two time points?
It can, but it gives up most of its diagnostic power. With a single pre-period and a single post-period you cannot eyeball parallel trends, cannot tell whether a pre-existing slope drove the estimate, and cannot run placebo leads. Whenever the panel allows it, use at least eight pre-period weeks and four post-period weeks, plot the series, and let the reviewer see what they are signing off on.
How do I handle a continuous treatment instead of a binary one?
Replace the treated dummy with the continuous treatment intensity and keep the post dummy. The interaction treatment_intensity * post estimates the effect per unit of intensity. The interpretation changes — you are estimating a slope rather than a level shift — and the parallel-trends assumption generalizes to "the slope of the outcome with respect to intensity was constant in the pre-period". The SQL is the same skeleton with a multiplication instead of a CASE.
How do I get confidence intervals for the DiD estimate?
Two clean routes. The regression route runs the OLS specification with clustered standard errors at the unit of assignment in statsmodels or fixest, and reports the CI on the interaction coefficient. The pure-SQL route bootstraps the DiD estimate by resampling users with replacement a few thousand times, recomputing the 2x2 table on each resample, and taking the 2.5th and 97.5th percentiles. Both should agree to two significant figures on any well-specified design.