Distribution drift in SQL
Contents:
Why drift monitoring matters
Your ML model was trained on a snapshot of the world. Feature distributions back then had specific means, variances, and class shares. When the world keeps moving and your model does not, accuracy degrades quietly — there are no exceptions, no failed builds, just slowly worsening predictions buried inside a predict() call. By the time the business team complains that "the model feels off," you have already lost weeks of decisions to a model that learned a distribution that no longer exists.
The same trap eats product analytics. Your PM at Stripe pings you on a Monday morning: "AOV dropped 8% week over week, what happened?" You sprint into the data warehouse and check launches, pricing, fraud rules. Nothing changed. What actually shifted was the customer mix — a paid campaign drove in a cohort of small-basket buyers and your AOV is now an average over two different populations. The product is fine, the marketing funnel is fine, the only thing broken is your mental model that AOV is a single distribution.
Drift detection is the first-line indicator for both situations. Run it on every feature, every cohort, every product metric you care about, on a daily or weekly schedule. When PSI or chi-square crosses a threshold you stop and ask: did the world change, did the data pipeline break, or did someone reclassify a column upstream? Each of those answers leads to a very different fix, but you only get to ask the question if you noticed.
PSI: Population Stability Index
PSI is the standard drift metric in credit scoring and increasingly in ML monitoring at companies like Airbnb, Uber, and DoorDash. The formula across discrete buckets is:
PSI = sum over buckets of (p_curr - p_base) * ln(p_curr / p_base)Where p_base is the share of observations in a bucket on the reference window and p_curr is the share on the current window. A perfectly stable distribution gives PSI = 0. A complete mass shift to a different bucket pushes PSI toward 1 or higher. The asymmetric (p_curr - p_base) * ln(p_curr / p_base) form penalizes both directions of movement, which is exactly what you want — a bucket that emptied out is just as suspicious as a bucket that filled up.
The math falls out of the Kullback-Leibler divergence between the two distributions, symmetrized. You do not need to derive it from scratch; the formula is small enough to fit in a single SQL SUM. The implementation work is in the bucketing, not the arithmetic.
PSI in SQL
The canonical recipe: split the reference period into 10 quantile buckets, freeze those bucket edges, then count how the current window falls into the same edges.
WITH base AS (
SELECT amount, NTILE(10) OVER (ORDER BY amount) AS bucket
FROM orders
WHERE created_at BETWEEN '2026-04-01' AND '2026-04-30'
),
base_thresholds AS (
SELECT
bucket,
MIN(amount) AS lo,
MAX(amount) AS hi,
COUNT(*)::NUMERIC / SUM(COUNT(*)) OVER () AS p_base
FROM base
GROUP BY bucket
),
current_period AS (
SELECT amount
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
),
current_distribution AS (
SELECT
bt.bucket,
bt.p_base,
COUNT(c.amount)::NUMERIC / NULLIF(SUM(COUNT(c.amount)) OVER (), 0) AS p_curr
FROM base_thresholds bt
LEFT JOIN current_period c
ON c.amount BETWEEN bt.lo AND bt.hi
GROUP BY bt.bucket, bt.p_base
)
SELECT
SUM((p_curr - p_base) * LN(NULLIF(p_curr, 0) / NULLIF(p_base, 0))) AS psi
FROM current_distribution;LN on zero raises an error or returns NULL depending on the engine. Wrap both arguments in NULLIF(x, 0) to make zeros propagate cleanly, or apply Laplace smoothing by adding + 0.0001 to every share before the log. Smoothing is more honest about the fact that a zero-count bucket is suspicious data, not infinite drift.
Chi-square for categoricals
For discrete features — country, channel, plan tier, device — PSI still works but the more natural test is Pearson's chi-square. You compute expected counts under the assumption that the categorical distribution did not shift, then sum the squared deviation from observed.
WITH base AS (
SELECT channel, COUNT(*) AS n FROM events
WHERE event_date BETWEEN '2026-04-01' AND '2026-04-30'
GROUP BY channel
),
current_period AS (
SELECT channel, COUNT(*) AS n FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY channel
),
total AS (
SELECT
(SELECT SUM(n) FROM base) AS base_total,
(SELECT SUM(n) FROM current_period) AS curr_total
),
expected AS (
SELECT
b.channel,
c.n::NUMERIC AS observed,
b.n::NUMERIC * t.curr_total / NULLIF(t.base_total, 0) AS expected
FROM base b
LEFT JOIN current_period c USING (channel)
CROSS JOIN total t
)
SELECT
SUM(POWER(observed - expected, 2) / NULLIF(expected, 0)) AS chi_square_drift
FROM expected;If chi_square_drift exceeds the critical value at df = N_categories - 1, the structure has shifted. For 10 channels the critical value at p=0.05 is about 16.9; at p=0.01 it is 21.7. Hardcoding the cutoff for the categorical cardinalities you actually monitor is fine — you do not need a CDF lookup table in SQL.
Thresholds and reactions
The industry conventions for PSI are:
SELECT
psi,
CASE
WHEN psi < 0.1 THEN 'stable'
WHEN psi < 0.25 THEN 'shift - monitor'
ELSE 'significant drift - retrain'
END AS verdict
FROM psi_result;PSI under 0.1 means the distribution is stable enough that you can trust the model and the metric. Between 0.1 and 0.25 something real has moved — log it, watch the next window, but do not page anyone yet. Above 0.25 you have significant drift and need to act: retrain the model, investigate the feature pipeline, or accept that the metric is no longer comparable to last quarter.
These thresholds come from decades of credit-scoring practice, but they are not laws of physics. Recalibrate them for your domain. A consumer app with rapid product changes might run hotter at 0.15 and only act above 0.4. A regulated lending model might page on 0.08. Pick the thresholds, write them down, and stop arguing about them in every standup.
Common pitfalls
The most common mistake is bucketing on the current period instead of the reference. PSI is defined against fixed bucket edges from the baseline — if you rebucket every run, both distributions will look like uniform-ish deciles by construction and PSI will collapse toward zero, masking real drift. The fix is to compute NTILE once on the reference window, save the bucket lower and upper bounds as constants, and then BETWEEN the current data into those frozen ranges.
A second trap is LN on zero buckets. When a category disappears or a quantile bucket goes empty in the current window, p_curr = 0 and the logarithm explodes. Engines variously return NULL, raise an error, or silently produce -Inf, and a single bad bucket can poison the entire SUM. Either wrap both p_curr and p_base in NULLIF(x, 0) so the term becomes NULL and is skipped, or apply Laplace smoothing by adding a small constant (1e-4 is standard) to every share before taking the log.
A subtler mistake is running PSI on your target variable. PSI was designed for input features. When the target distribution shifts you are looking at concept drift, which requires different tools: tracking model performance against ground truth, not just inputs. If your y distribution drifts but model accuracy holds, you may have actually learned the relationship correctly. If accuracy drops with no input drift, you have concept drift. Mixing these up sends you chasing the wrong root cause.
People also compare yesterday against today and call it drift monitoring. Day-over-day noise dominates the signal, especially for small-volume features. The reference window should be static — the model's training set, or a known-good period from before launch — and the current window should be long enough to dampen daily noise. Weekly comparisons against a fixed monthly baseline catch real shifts without firing every Monday morning.
The last big trap is ignoring seasonality. The order distribution at Amazon in December does not look like the order distribution in June, and PSI will scream every Q4 if your reference is summer data. Either pick a reference window from the same season, run year-over-year comparisons instead of period-over-period, or maintain seasonal baselines that you switch between as the calendar rolls.
Optimization tips
PSI queries scan two long windows of an events table, so the usual warehouse rules apply. On Snowflake or Databricks, partition the source table by event_date so the WHERE clauses prune partitions instead of scanning history. On Postgres, a btree index on (event_date, amount) keeps both the reference and current scans cheap. If the table is enormous and you only need approximate PSI, sample 10% on both sides — the metric is robust to sampling and the query time drops by an order of magnitude.
Materialize the reference distribution. The bucket edges and p_base shares do not change until you choose a new baseline, so cache base_thresholds as a small table refreshed monthly. Each daily PSI run then only scans the current window and joins against a 10-row constant. This is the difference between a 30-second query and a 5-minute one when the events table is at the billions-of-rows scale.
For chi-square on high-cardinality categoricals (think 10,000 country-channel combinations), pre-aggregate to top-N plus an "other" bucket. The math still works, you preserve interpretability when alerts fire, and you avoid degrees-of-freedom blowing up to the point where the critical value is essentially infinite. Most real drift lives in the head of the distribution anyway.
Related reading
- How to calculate chi-square test in SQL
- How to calculate data quality score in SQL
- How to monitor data volume anomaly in SQL
- SQL window functions interview questions
If you want to drill SQL questions like these every day, NAILDD is launching with 500+ SQL problems built around exactly this pattern.
FAQ
PSI versus the KS test — which one should I use?
PSI operates on bucketed distributions and is the default in credit scoring and ML monitoring because it gives a single bounded number you can threshold. The Kolmogorov-Smirnov test works on continuous distributions and is more sensitive on small samples, but it returns a p-value that is awkward to threshold over time. Use PSI for ongoing monitoring dashboards. Use KS when you have a one-shot question like "did this batch of users come from the same distribution as training" and you care about statistical significance.
How many buckets should I use for PSI?
Ten is the industry standard and what every credit-scoring textbook uses. Fewer buckets lose resolution and hide real shifts inside a bucket; more buckets make the metric statistically noisy because individual bucket counts get small. For features with natural cut points (age groups, plan tiers) use those instead of quantiles and the number-of-buckets question answers itself.
How do I choose the reference window?
For a model in production, the reference is the training set itself — that is the distribution the model was calibrated against, and drift away from it is exactly what you want to measure. For pure product metrics with no model, pick a known-stable month and freeze it. The crucial property is that the reference is fixed; rolling references make drift look smaller than it is because the baseline drifts along with the current.
My PSI is above 0.5 — what does that mean?
Anything past 0.5 is dramatic and almost always indicates a broken pipeline rather than a real-world shift. Check upstream: did a column type change, did a join start dropping rows, did an event get renamed, did a vendor switch their schema. Genuine market drift of that magnitude takes months to build up; pipeline drift of that magnitude takes one bad deploy.
Can I use PSI on binary features?
Yes — PSI on a binary feature is mathematically equivalent to a chi-square test on two buckets, so you get the same drift signal. The interpretation is the same, the thresholds are the same, and the SQL is even simpler because you only have two buckets to compute. Run PSI on every binary feature in your model and put them on the same dashboard as continuous ones.
How often should I run the drift check?
Daily for high-volume features and any feature feeding a real-time model; weekly for slower-moving product metrics. The point is not the frequency, it is keeping the cadence steady so you can reason about whether a spike is a real change or a calendar artifact. Set the schedule once and only change it when the cadence demonstrably misses something important.