How to calculate forecast bias in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why forecast bias matters

Mean absolute error and root mean square error answer the question "how wrong was the forecast" but they say nothing about which direction the model leans. Forecast bias is the direction of the systematic error. If bias is plus five percent, every prediction is on average five percent below reality and the model is structurally broken — yet MAE will look perfectly reasonable. In supply chain, inventory planning, capacity provisioning, and finance, knowing the sign of the error is often more important than knowing its magnitude.

Your demand planner pings you on a Monday morning: "We keep running out of stock in three cities, and we keep over-buying in two others. The MAE on the model has been flat for months. What is going on?" The answer is almost certainly that the model is unbiased overall but heavily biased inside each region — the positive and negative errors cancel in the global average and stay hidden until someone looks at the segment breakdown. This pattern shows up in analytics rounds at Amazon, DoorDash, Uber, Stripe, Snowflake, Databricks, and Netflix, and the candidates who handle it well start with bias, not with MAE.

This guide walks through the SQL recipes that come up in those interviews and in real on-call rotations. We will compute the plain mean error, normalize it by the actual value, decompose it by segment and over time, and end with the optimization tricks that make these queries fast on a billion-row backtest table. The patterns are portable across Snowflake, BigQuery, Redshift, and vanilla Postgres — only the date-truncation syntax changes.

Mean error in SQL

The simplest bias metric is the signed mean error: the average of actual minus prediction over the backtest window. A positive value means the model is underforecasting — actuals are running higher than predictions, and the team is consistently caught short. A negative value means the model is overforecasting — predictions are higher than reality, and the team is consistently over-buying or over-staffing.

SELECT
    AVG(actual - prediction)                  AS mean_error,
    SUM(actual - prediction)                  AS total_error,
    COUNT(*)                                  AS n,
    CASE
        WHEN AVG(actual - prediction) > 0 THEN 'underforecasting'
        WHEN AVG(actual - prediction) < 0 THEN 'overforecasting'
        ELSE 'balanced'
    END                                       AS direction
FROM forecasts_backtest
WHERE event_date >= CURRENT_DATE - INTERVAL '30 day';

A single point estimate of mean_error can be noise, especially on short windows or volatile series. The honest move is to bootstrap a confidence interval — resample the daily errors with replacement a few thousand times, take the 2.5th and 97.5th percentiles of the bootstrap means, and check whether zero falls inside that interval. If zero is inside, you cannot reject the null that the model is unbiased. If zero is outside, the bias is statistically real and worth fixing.

Carry the total error alongside the average. Mean error of plus two units sounds harmless until you multiply by a thousand days and realize the cumulative shortfall is two thousand units, which translated into a real inventory line means weeks of stockouts on a popular SKU.

Percentage bias and signed sMAPE

Mean error is scale-dependent. An error of ten on a daily series that averages a million is invisible; the same error on a series averaging twenty is catastrophic. Normalizing by the actual value converts the error into a unit-free percentage that is comparable across products and time ranges.

SELECT
    AVG((actual - prediction) / NULLIF(actual, 0)) * 100                       AS mpe_pct,
    AVG((actual - prediction) / NULLIF((actual + prediction) / 2.0, 0)) * 100  AS smape_signed_pct
FROM forecasts_backtest
WHERE event_date >= CURRENT_DATE - INTERVAL '30 day'
  AND actual > 0;

The NULLIF and the explicit actual > 0 filter guard against division by zero when the series has empty days. On retail and marketplace data this matters — closed stores, zero-traffic Sundays, and brand-new SKUs all produce legitimate zeros that would otherwise blow up the percentage. If you have many such days, switch to the symmetric variant in the second column: signed sMAPE divides by the average of actual and prediction, which stays finite even when one side is zero. The trade-off is that sMAPE has a hard ceiling around plus or minus two hundred percent and is asymmetric in subtle ways — overforecasts and underforecasts of the same magnitude do not produce equal-magnitude sMAPE values. Carry both columns: MPE is the cleaner interpretation when actuals are well above zero, signed sMAPE is the safer fallback when the series has sparse or zero days.

Bias by segment

A bias of zero overall can still hide large opposing biases inside segments. The classic case is iOS at plus ten percent and Android at minus ten percent on the same product — the model is wrong on every device, but the global average looks perfect because the errors cancel.

SELECT
    segment,
    COUNT(*)                                                                  AS n,
    AVG(actual - prediction)                                                  AS mean_error,
    AVG((actual - prediction) / NULLIF(actual, 0)) * 100                      AS mpe_pct,
    CASE
        WHEN ABS(AVG((actual - prediction) / NULLIF(actual, 0))) > 0.05
            THEN 'biased'
        ELSE 'ok'
    END                                                                       AS verdict
FROM forecasts_backtest
WHERE event_date >= CURRENT_DATE - INTERVAL '90 day'
  AND actual > 0
GROUP BY segment
HAVING COUNT(*) >= 30
ORDER BY ABS(AVG(actual - prediction)) DESC;

The HAVING COUNT(*) >= 30 filter drops segments with too few observations to read the bias signal — anything smaller is dominated by noise and produces false positives. Five percent is a reasonable starting tolerance for product analytics; for finance and inventory the band is usually tighter, around one to two percent. The segments that matter depend on the product: split by platform, country, and acquisition channel for consumer apps; by category and seller cohort for marketplaces; by plan tier and tenure bucket for SaaS. When a segment crosses the threshold, the standard next step is to retrain on segment-specific features or to introduce a per-segment bias-correction layer on top of the global forecast.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Bias over time

The third decomposition is over time: compute bias inside rolling weeks or months and watch how it moves. The two failure modes show up clearly in this view — a bias that flips sign every few weeks (unstable model) and a bias that drifts monotonically in one direction (stale model).

SELECT
    DATE_TRUNC('week', event_date)::DATE  AS week,
    COUNT(*)                              AS n,
    AVG(actual - prediction)              AS mean_error,
    AVG((actual - prediction) / NULLIF(actual, 0)) * 100 AS mpe_pct
FROM forecasts_backtest
WHERE event_date >= CURRENT_DATE - INTERVAL '180 day'
GROUP BY DATE_TRUNC('week', event_date)
ORDER BY week;

An unstable bias usually points to a model that is overreacting to recent data, often because the smoothing weight is too high or because a recent feature was added with insufficient regularization. The fix is a longer training window or a smaller learning rate. A monotonically drifting bias is more dangerous: the model used to be calibrated, and now it is consistently over or undershooting by more each week. The cause is almost always concept drift — the underlying distribution has shifted, and the training data is no longer representative. Cross-reference this with distribution drift and changepoint detection signals before you commit to a retrain.

Common pitfalls

The first trap is reporting only MAE or RMSE and concluding the model is healthy when bias is large. A model can have low MAE and high bias at the same time. If MAE is ten and mean error is eight, roughly eighty percent of the error is one-sided — every day the model leans the same way, and the team downstream pays for it whether the bias is in inventory, capacity, or revenue. Bias and accuracy are independent dimensions and both belong on the monitoring dashboard.

The second trap is reading "bias equals zero" as "everything is fine." A perfectly balanced average can still hide enormous individual errors that happen to cancel: plus fifty one day, minus fifty the next, ten thousand days in a row, mean error zero, the operations team in tears. Bias must always be reported alongside MAE so reviewers can tell a calm model from a wildly oscillating one. The pair mean_error close to zero, MAE small is the only safe configuration.

The third trap is computing MPE without guarding against zero actuals. On any series that occasionally hits zero — closed days, new products, paused campaigns — a single zero day produces a divide-by-zero that either crashes the query or, worse, silently coerces to null and biases the average. The fix is the explicit actual > 0 filter or the symmetric sMAPE variant. Document the choice so reviewers know which days were excluded.

The fourth trap is failing to segment. A model can look unbiased in the headline number and be catastrophically biased inside every segment that matters. iOS at plus ten percent and Android at minus ten percent average out to zero; California overforecast and Texas underforecast average out to zero. The segment query above is not optional — it is the diagnostic that separates a healthy model from a hidden disaster.

The fifth trap is measuring bias on in-sample data. Training-set bias is essentially zero for any model fit with a mean-squared-error objective — the optimization explicitly drives the residuals to balance. Bias must be measured on out-of-sample data, ideally on a rolling backtest that mimics how the model is actually deployed. If you cannot demonstrate that the bias number was computed on data the model never saw, the number is not trustworthy.

Optimization tips

The most expensive part of a bias monitoring stack is the rolling-window recomputation. Every day, the dashboard wants to know the bias over the trailing thirty, ninety, and one hundred eighty days, broken out by ten segments. Computed naively, that is ten queries scanning six months of backtest data every morning. The single highest-leverage change is to materialize daily bias snapshots: one row per day per segment per window, populated once by a scheduled job, read in milliseconds by the dashboard.

Partition the backtest table by event_date. On Snowflake the equivalent is clustering by the date column; on BigQuery use partitioned tables with _PARTITIONDATE; on Postgres use declarative partitioning by month. A six-month bias window should touch six partitions, not the entire backtest history. If EXPLAIN shows the query scanning the whole table, partition pruning is not firing and the predicates need to be tightened.

If you want to drill SQL questions like forecast diagnostics every day, NAILDD is launching with 500+ SQL problems built around exactly this pattern.

FAQ

What level of forecast bias is acceptable?

For consumer-product analytics, an absolute MPE under two percent is excellent, two to five percent is workable, and anything above five percent is a signal to investigate. For inventory and capacity planning the tolerance is tighter — most ops teams flag anything above one percent because the downstream cost of stockouts and overstock dominates the cost of the model fix. The right threshold is set by the business, not by the model: ask what one percent of overforecast costs in dollars, and the threshold writes itself.

Bias is zero but MAE is large — what does that mean?

The model is noisy but not directionally wrong. Predictions overshoot on some days and undershoot on others by roughly equal amounts, and the errors cancel in the mean. This is a different failure mode than systematic bias and calls for different fixes: feature engineering, more training data, or a model with lower variance. Bias-correction layers will not help when the model is unbiased in expectation but high variance in practice.

Is positive bias always bad?

Positive bias means the model is underforecasting. Whether that is bad depends on the asymmetric cost downstream. For inventory, underforecast leads to stockouts, which are usually more painful than overstock — so positive bias is worse than the equivalent negative bias. For capacity, underforecast leads to overload and service degradation, again usually worse than running with spare headroom. For revenue planning, underforecast is conservative and often preferred over a confident overforecast that the board then has to walk back.

How do I fix a biased forecast?

The cheapest fix is a bias-correction layer on top of the existing model: estimate the residual mean on a rolling backtest, subtract it from every new prediction, and monitor whether the corrected residuals stay centered on zero. For drifting biases the right move is retraining with a more recent window, sometimes with concept-drift weighting that downweights older observations. For segment-specific biases the fix is segment-specific features or per-segment correction terms; a single global correction will not solve a problem that lives in the segment breakdown.

Is forecast bias the same as tracking signal?

They are closely related but not identical. Forecast bias is a single number — the average signed error over a window. Tracking signal is bias normalized by mean absolute deviation and monitored over time, typically with control limits at plus and minus four or six. Tracking signal is the operational monitoring metric — it tells the on-call analyst "the bias has crossed the threshold, page the model owner." Forecast bias is the diagnostic metric — it tells the analyst what direction the model is leaning and by how much.