How to detect anomalies in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why anomaly detection matters

DAU yesterday was 50K. Today it is 80K. Is that a real spike, a marketing push paying off, or a logging bug double-counting events? Without a defensible answer your dashboard becomes background noise, and you end up paging engineers on noise while missing the one revenue drop that actually mattered. Anomaly detection is the layer that turns a raw time series into a signal you can act on: "expected 48 to 52K, actual 80K, three standard deviations above baseline, please investigate."

This question shows up in interviews at Stripe, Uber, DoorDash, Airbnb, Netflix, and pretty much every consumer company that runs on metrics. The framing is almost always the same: imagine your VP of Product Slacks you on a Sunday saying revenue is "weird" today — write SQL that flags whether it is actually weird, and how weird. Strong candidates pick a method, justify the threshold, and call out at least one trap (seasonality, contaminated baseline, weekend dip) before writing a single line of code. Weak candidates jump straight to AVG() and ship false positives.

The good news is you do not need a model in production to answer this. Three SQL patterns cover ninety percent of the real cases: z-score against a clean baseline, interquartile range for skewed metrics, and a rolling window when your series has a trend. This post walks each one end to end with runnable SQL, then collects the pitfalls that get candidates dinged in on-site loops.

Pick the right method

Method Best for
Z-score Roughly normal data, point anomalies
IQR Skewed data, robust to outliers
Rolling window Time series with a trend
Seasonal decomposition Strong weekly or daily cycles (DAU, sales)
ML (isolation forest, autoencoder) Multivariate anomalies you cannot reduce

In an interview, lead with the simplest method that fits the data. If the prompt says "metric is roughly stable week over week" go with z-score. If the prompt mentions revenue per user or session duration, switch to IQR because the tail is fat. If the prompt mentions "growing product" or shows a clear upward slope, you need a rolling window. Naming the method and one sentence of why is worth more than a clever query.

Z-score method

The z-score method assumes your metric is roughly bell-shaped. You compute the mean and standard deviation over a clean baseline window — say the last ninety days, excluding the most recent day — then score each new observation as the number of standard deviations away from that mean. Anything beyond three standard deviations is flagged as an anomaly, anything between two and three as suspicious.

The one detail that matters: exclude the day you are evaluating from the baseline. Otherwise a true anomaly inflates the standard deviation and hides itself. This is the kind of detail interviewers explicitly look for.

WITH stats AS (
    SELECT
        AVG(value)    AS mean,
        STDDEV(value) AS std
    FROM metrics
    WHERE DATE >= CURRENT_DATE - INTERVAL '90 days'
      AND DATE <  CURRENT_DATE - INTERVAL '1 day'
)
SELECT
    m.DATE,
    m.value,
    (m.value - s.mean) / NULLIF(s.std, 0) AS z_score,
    CASE
        WHEN ABS((m.value - s.mean) / NULLIF(s.std, 0)) > 3 THEN 'anomaly'
        WHEN ABS((m.value - s.mean) / NULLIF(s.std, 0)) > 2 THEN 'suspicious'
        ELSE 'normal'
    END AS status
FROM metrics m
CROSS JOIN stats s
WHERE m.DATE >= CURRENT_DATE - INTERVAL '7 days'
ORDER BY m.DATE DESC;

A z of three corresponds, under a normal assumption, to about one in three hundred days being flagged — roughly one false positive per year per metric. With fifty metrics on your dashboard that is one false alarm a week from pure noise. Keep that math in your head when you set thresholds.

IQR method

When your metric is skewed — revenue per order, session duration, time to first purchase — the mean and standard deviation are themselves pulled around by outliers, which is the opposite of what you want for a detector. The interquartile range fixes this by anchoring on the twenty-fifth and seventy-fifth percentiles, which are barely affected by extreme values. You then flag anything more than 1.5 IQR below Q1 or above Q3.

WITH percentiles AS (
    SELECT
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3
    FROM metrics
    WHERE DATE >= CURRENT_DATE - INTERVAL '90 days'
)
SELECT
    m.DATE,
    m.value,
    p.q1,
    p.q3,
    p.q3 - p.q1                              AS iqr,
    p.q1 - 1.5 * (p.q3 - p.q1)               AS lower_bound,
    p.q3 + 1.5 * (p.q3 - p.q1)               AS upper_bound,
    CASE
        WHEN m.value < p.q1 - 1.5 * (p.q3 - p.q1)
          OR m.value > p.q3 + 1.5 * (p.q3 - p.q1)
        THEN 'anomaly'
        ELSE 'normal'
    END AS status
FROM metrics m
CROSS JOIN percentiles p
WHERE m.DATE >= CURRENT_DATE - INTERVAL '7 days'
ORDER BY m.DATE DESC;

In practice teams default to IQR for any metric they have not profiled, then switch to z-score only after they have plotted a histogram and seen the normality. It is the safer first choice in an interview, and you can mention that if asked to compare.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Rolling window method

The previous two methods assume the baseline is stationary. That assumption breaks the moment your product is growing — a healthy 10 percent week-over-week increase will trip the z-score detector every Monday for the same reason a tide trips a flood gauge. The rolling window method swaps the fixed baseline for a moving one, computed over the last N days, so the detector tracks the trend.

formula:
    rolling_z = (value - rolling_mean) / rolling_std
    rolling_mean and rolling_std are computed
    over the previous 14 days, excluding today.
WITH rolling AS (
    SELECT
        DATE,
        value,
        AVG(value)    OVER (
            ORDER BY DATE
            ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING
        ) AS rolling_mean,
        STDDEV(value) OVER (
            ORDER BY DATE
            ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING
        ) AS rolling_std
    FROM metrics
)
SELECT
    DATE,
    value,
    rolling_mean,
    rolling_std,
    (value - rolling_mean) / NULLIF(rolling_std, 0) AS rolling_z,
    CASE
        WHEN ABS((value - rolling_mean) / NULLIF(rolling_std, 0)) > 3
        THEN 'anomaly'
        ELSE 'normal'
    END AS status
FROM rolling
WHERE DATE >= CURRENT_DATE - INTERVAL '7 days'
ORDER BY DATE DESC;

A two-week window is a reasonable default — long enough to absorb day-to-day noise, short enough to follow trend. Stretch to twenty-eight days if your weekly seasonality is strong; shrink to seven if your product is changing every release. The same ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING pattern shows up in plenty of analyst on-sites — if you want to drill these, the SQL window functions interview questions post has the full set.

Common pitfalls

When teams roll their own anomaly detection for the first time, the most common mistake is to use a baseline that already contains the anomaly. If yesterday's logging bug is sitting in the ninety-day mean, the mean and the standard deviation are both inflated, and the bug hides itself in plain sight. The fix is either to exclude the day being scored from the baseline window or to first clean the baseline by removing known incidents from a separate incidents table.

A close second is ignoring seasonality. Daily active users on a Saturday are routinely thirty percent below the weekday average for almost every B2B product, which means a flat z-score detector will fire every weekend. The fix is to compare same-day-of-week — Saturday against Saturdays — by partitioning the baseline on EXTRACT(DOW FROM date). Hourly metrics need the same treatment with hour-of-day, and seasonal businesses like e-commerce need year-over-year deltas around holidays.

A third trap is treating the z-score threshold of three as a law of physics. Three works on roughly normal data, but on noisy event streams it produces too many false positives, and on heavily smoothed metrics it produces too few. Calibrate the threshold to your false-positive budget: if the on-call engineer can tolerate one alert per month per metric, work backwards from there and pick the z that delivers that frequency on a backtest of the last year of data.

A fourth pitfall is monitoring a single metric in isolation. Real anomalies are usually multivariate — payments are down because checkout latency is up because a cache is cold. Single-metric z-scores will catch the symptom but miss the cause, and you end up paging the wrong team. The practical fix is to layer detectors: keep univariate z-scores for the headline metrics, but add a correlation check that flags when two metrics that normally move together suddenly diverge.

A fifth pitfall is alerting without context. "Revenue anomaly!" is useless at 2 a.m. The on-call engineer needs to know which segment, which country, which platform, and how much the value deviated. Bake the decomposition into the alert payload — top three contributing dimensions, last known good value, and a link to the dashboard — and the mean time to triage drops by an order of magnitude.

Optimization tips

On tables larger than a few hundred million rows the three queries above turn from milliseconds to minutes. The first lever is partitioning: anomaly detection queries always filter on date, so daily or weekly partitions cut the scanned bytes by one to two orders of magnitude. On Snowflake and BigQuery this is free as long as you cluster or partition on the date column at table creation; on Postgres it requires explicit declarative partitioning.

The second lever is precomputing baselines. The ninety-day mean and standard deviation barely change from one day to the next, so it is wasteful to recompute them on every alert run. A materialized view refreshed once a day — or a small baseline_stats table updated by an Airflow task — is enough. Your detection query then becomes a single scan of the last seven days joined to a tiny stats row.

The third lever is keeping STDDEV and PERCENTILE_CONT off raw event tables. Both are non-additive, which means they cannot use the partial aggregates Snowflake and BigQuery cache for SUM and COUNT. Roll up to a daily metric table first — one row per metric per day — and run the detector on top. The roll-up scans the big table once a day; everything downstream is cheap.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Z-score or IQR — which should I reach for first?

Default to IQR if you have not yet plotted a histogram of the metric. IQR is robust to the long tails that almost every business metric has — revenue per order, session length, request latency — and it does not punish you for being wrong about normality. Switch to z-score only after you have confirmed the metric is roughly bell-shaped, which in practice means counts of independent events like signups per day or page views per minute on a stable surface.

What threshold should I use?

Three standard deviations is the textbook default for z-score, and 1.5 IQR is the textbook default for the interquartile method. Both are starting points, not laws. Calibrate them on a backtest: pull the last twelve months of the metric, count how many points each threshold would flag, and ask whether that matches your tolerance. On noisy event streams you often end up at z > 4 or 2 IQR; on smoothed daily metrics you can sometimes tighten to z > 2.5.

Real-time or batch?

Batch every fifteen minutes is enough for ninety percent of analytics metrics — DAU, signups, orders, conversion rate. Real-time, meaning sub-minute detection, is only worth the engineering cost for metrics where the cost of a delayed alert exceeds the cost of a false positive: payments fraud, infrastructure outages, account takeover signals. Start batch, prove the detector works, then escalate the highest-value metrics to streaming.

Should I run anomaly detection in SQL or in a dedicated tool?

SQL is the right tool for exploration and one-off investigations. For production monitoring most teams eventually move to a dedicated stack — Prometheus with Alertmanager for infrastructure, Datadog or Grafana for application metrics, a homegrown service on Kafka for business KPIs. The SQL versions here are not throwaway though: they are exactly what you need for the interview question, and they double as the prototype before you commit to a vendor.

What do I do after the detector fires?

Four steps. Validate the signal — is the data healthy, or did an ETL job fail and produce a fake spike? Notify the metric owner — Slack the team, not the whole company, and include the decomposition. Investigate root cause — segment the metric by the dimensions you have, and find the smallest cell that explains the deviation. Document the incident so the next person who sees a similar shape can find your notes.

Does this work for multivariate anomalies?

Not really. The three SQL methods here detect anomalies in a single series at a time. For multivariate anomalies — where each metric on its own looks fine but the joint distribution is off — you need isolation forests, autoencoders, or a correlation check that flags when two normally-coupled metrics decouple. Most teams start with univariate detectors on the top KPIs and add multivariate models once the univariate layer is calibrated.