How to calculate Brier score in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What Brier score is and why it matters

Your PM at Stripe slides into Slack on a Monday morning asking whether the fraud probability model is "actually calibrated." Finance wants to know if a 0.8 score really corresponds to an 80% chargeback rate. Accuracy will not answer that, and AUC tells you whether the ranking is good but not whether the probabilities are honest. This is where Brier score earns its keep.

Brier score is the mean squared error between predicted probabilities and the actual binary outcome. A model that always outputs 0.5 on a balanced dataset gets a Brier of 0.25. A perfect oracle that outputs 1.0 on every positive and 0.0 on every negative gets a Brier of 0.0. Everything else lives between those endpoints, and lower is better.

Analysts at Meta, Airbnb, and DoorDash reach for Brier when calibration matters as much as ranking. Log loss explodes when you predict 0.01 and the event happens, which is sometimes useful and sometimes a numerical headache. Brier squares the residual, so a 0.1 miss costs 0.01 and a 0.5 miss costs 0.25. It is the metric you want when the downstream pipeline multiplies probabilities by money.

The Brier score formula

The binary Brier score is the average squared error between the predicted probability and the binary label:

brier = (1/N) * sum_i (p_i - y_i)^2

y_i is the actual label (0 or 1), and p_i is the predicted probability of the positive class, bounded in [0, 1]. A naive constant predictor that always returns the base rate scores roughly p * (1 - p), so on a balanced dataset that baseline lands at 0.25. Anything below 0.2 in a real product setting usually signals a competent model.

For multi-class problems with K mutually exclusive classes, the standard generalization sums squared errors across classes:

brier_multi = (1/N) * sum_i sum_k (p_ik - y_ik)^2

y_ik is 1 if class k is the true class for observation i and 0 otherwise. In SQL you materialize the one-hot representation with a JOIN against the class list, then average the squared deltas.

Brier score in SQL

The simplest case in a transactional warehouse like Snowflake, BigQuery, or Postgres is a table with one row per scored event, the predicted probability, and the realized binary outcome. Here is the canonical query for a 30-day rolling window:

SELECT
    AVG(POWER(predicted_proba - actual_label, 2)) AS brier_score,
    COUNT(*) AS n
FROM model_predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days';

A few details matter. The predicted_proba column should be a true probability in [0, 1]; if your model emits logits, apply a sigmoid before squaring. Treat NULLs explicitly because POWER(NULL - 1, 2) returns NULL and AVG skips NULLs silently. For late-arriving labels, decide whether to exclude them or to wait for a settlement window before scoring.

You will often want a per-segment Brier — by country, channel, or model version:

SELECT
    model_version,
    country,
    AVG(POWER(predicted_proba - actual_label, 2)) AS brier_score,
    COUNT(*) AS n,
    AVG(actual_label) AS base_rate
FROM model_predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY model_version, country
HAVING COUNT(*) >= 500
ORDER BY model_version, country;

The HAVING COUNT(*) >= 500 clause is a guardrail. With small samples the variance of Brier is high, and a few unlucky positives can make a healthy model look bad. At Uber-scale logging this rarely matters, but for a niche partner program with a few hundred events per month, you want the minimum-sample gate.

Brier skill score against a baseline

Raw Brier numbers depend on the base rate. A model that scores 0.04 on a 1%-positive dataset is not automatically better than one scoring 0.20 on a 50%-positive dataset — the easy problem has lower variance. Brier skill score (BSS) normalizes against a "predict the base rate for everyone" baseline:

BSS = 1 - brier_model / brier_baseline

BSS of 0 means your model matches the constant baseline. Positive values mean the model adds skill; negative values mean it is worse than not modeling at all, which happens more often than people admit with miscalibrated tree ensembles on a stale window.

The baseline Brier is p_base * (1 - p_base) where p_base is the average label, but it is cleaner to compute it explicitly:

WITH base AS (
    SELECT
        AVG(actual_label) AS p_base,
        COUNT(*) AS n
    FROM model_predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
),
metrics AS (
    SELECT
        AVG(POWER(mp.predicted_proba - mp.actual_label, 2)) AS brier_model,
        AVG(POWER(b.p_base - mp.actual_label, 2)) AS brier_baseline
    FROM model_predictions mp
    CROSS JOIN base b
    WHERE mp.prediction_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
    brier_model,
    brier_baseline,
    1 - brier_model / NULLIF(brier_baseline, 0) AS bss
FROM metrics;

Rule of thumb: BSS above 0.05 is meaningful for noisy domains like ad click prediction, BSS above 0.2 is solid for well-instrumented churn models, and BSS above 0.5 should make you suspect data leakage. Always verify on a held-out window the model never touched.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Reliability diagrams and calibration

Brier decomposes into calibration, resolution, and uncertainty. You do not need the full Murphy decomposition for day-to-day work, but you do need to know whether predictions are well-calibrated across the score range. A model can have a respectable Brier and still be systematically overconfident in the 0.7-0.9 bucket, which is where business decisions get made.

The standard tool is a reliability diagram: bin predictions into deciles, compute mean predicted probability and actual observed rate inside each bin, then table the gap:

WITH bins AS (
    SELECT
        WIDTH_BUCKET(predicted_proba, 0, 1, 10) AS bin,
        AVG(predicted_proba) AS mean_pred,
        AVG(actual_label) AS actual_rate,
        COUNT(*) AS n
    FROM model_predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY WIDTH_BUCKET(predicted_proba, 0, 1, 10)
)
SELECT
    bin,
    n,
    mean_pred,
    actual_rate,
    actual_rate - mean_pred AS calibration_gap
FROM bins
ORDER BY bin;

A well-calibrated model produces mean_pred and actual_rate that track within a percentage point or two in each bin, with the gap oscillating around zero. A positive gap in the high-probability bins means underconfidence — saying 0.7 when reality is 0.85, which costs nothing in ranking but everything in expected-value math. A negative gap is overconfidence and is the more dangerous failure mode because downstream automations treat 0.95 as near-certainty.

For platforms without WIDTH_BUCKET, use FLOOR(predicted_proba * 10). Clamp with LEAST(predicted_proba, 0.9999) so a value of exactly 1.0 lands in bin 10 rather than spilling into bin 11.

Common pitfalls

The first pitfall is scoring Brier on hard class predictions instead of probabilities. If your serving layer rounds the model output to 0 or 1 before logging, you have recreated accuracy and lost everything that makes Brier useful. Log the raw probability, not the threshold-applied class. This sounds obvious until you discover someone added a ROUND(probability) to the warehouse ingestion job to "save space" three sprints ago.

The second pitfall is comparing Brier across datasets with different base rates. A churn model scoring 0.04 on a 5%-churn SaaS product is not automatically better than one scoring 0.12 on a 30%-churn marketplace. The easy problem has a smaller variance ceiling, and the constant baseline already gets to 0.0475 versus 0.21. Always compare BSS, not raw Brier, when crossing population boundaries.

The third pitfall is mixing up Brier and log loss when reading model selection reports. Log loss penalizes confidently-wrong predictions much more harshly because the loss approaches infinity as the predicted probability approaches zero on a positive example. Brier squares a residual bounded at one, so the worst possible miss costs you 1.0. Use log loss when you need to drive overconfidence out of an automated payout system, and Brier when you want a smooth bounded metric for cohort analysis.

The fourth pitfall is forgetting to divide by N when implementing the decomposition manually. Classical Brier is a mean, not a sum. Aggregating with SUM(POWER(p - y, 2)) will produce a number that grows with traffic and looks like a regression for no reason. Always confirm you are looking at AVG(POWER(...)) before exposing the dashboard.

The fifth pitfall is naively extending the binary formula to multi-class without one-hot encoding. The multi-class Brier needs one squared term per class, summed before averaging. Keep a class dimension table, CROSS JOIN it with predictions, then express y_ik as CASE WHEN actual_class = class_id THEN 1 ELSE 0 END. Computing Brier only on the predicted class loses every other class contribution and silently inflates the score.

Optimization tips

Brier is cheap per row — one subtraction and one squaring. The pain is upstream: scanning the full model_predictions table when you need 30 days, or joining a labels table not clustered on prediction_date. Partition or cluster on the date column your dashboard filters on, and the query runs in seconds on Snowflake or BigQuery.

For high-throughput models like ad ranking or real-time fraud, pre-aggregate Brier into a daily summary table with one row per (model_version, segment, date). Store sum_sq_error, sum_y, and n. Dashboards reconstruct Brier as sum_sq_error / n and the base rate as sum_y / n without scanning raw events. The same summary powers BSS because brier_baseline = p_base * (1 - p_base).

If you run calibration diagnostics often, materialize the binned reliability table as a daily view. The bin assignment never changes for a given probability, so compute it once at write time. This pattern works well with Databricks Delta Live Tables or Snowflake dynamic tables, where the materialization is incremental.

If your warehouse rejects POWER(x, 2) due to older engine quirks, use (x) * (x) directly. The multiplication form sidesteps cases where POWER casts to DOUBLE PRECISION and loses precision against NUMERIC arithmetic.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Brier score vs log loss — which one should I report?

Both measure probability quality but differently. Brier is mean squared error between probability and label, bounded in [0, 1], smooth across the range. Log loss is cross-entropy, unbounded on the upside, and penalizes confidently-wrong predictions exponentially. Use Brier when you want a number that travels well across teams and pairs naturally with calibration plots. Use log loss when training the model or when the downstream cost of confidently-wrong predictions is severe, like a credit system paying out millions on a 0.99 score. Many teams report both and treat agreement between them as a sanity check.

What counts as a "good" Brier score?

It depends on the base rate. A naive constant baseline scores p * (1 - p), so on a 5%-positive churn problem the baseline is 0.0475, while on a 50%-positive task it is 0.25. Always compare a candidate model against this baseline using BSS. BSS above 0.05 is meaningful in noisy domains, above 0.2 is solid for clean prediction tasks, and above 0.5 should trigger a leakage investigation. Reporting raw Brier without the baseline is like reporting accuracy without the class balance.

Can I use Brier score for regression problems?

No. Brier is defined for probability outputs against binary or categorical labels. If your target is continuous — order revenue, time until churn, latency in milliseconds — use MSE, RMSE, or MAE instead. Brier is exactly MSE applied to probabilities, so the formulas look identical, but the interpretation only makes sense when both prediction and label live in [0, 1] and the label is an event indicator.

Is Brier score the same thing as calibration?

Not quite, though calibration is one of its three components. The Murphy decomposition splits Brier into calibration (how close mean predicted probabilities are to observed frequencies in each bin), resolution (how much predictions vary across bins), and uncertainty (the irreducible variance of the label). Two models can have the same Brier with very different decompositions — one well-calibrated but low-resolution, the other high-resolution but miscalibrated. Pair every Brier number with a reliability diagram.

Should I compute Brier on validation or test data?

Test, almost always. Validation Brier reflects data the modeling pipeline has touched during threshold and hyperparameter selection. Reporting validation Brier to stakeholders as "the model's performance" is the classic trap that gets you a surprised look six weeks after launch. Compute Brier on a held-out test window the model has never seen, and rotate the test window monthly so calibration drift surfaces in dashboards rather than one-off slides.

How do I extend Brier to multi-class models?

The multi-class Brier sums squared errors across all classes, then averages over observations. In SQL you typically have a predictions table with one row per (observation_id, class_id) and the predicted probability for that class. Join to a labels table, materialize the one-hot indicator as CASE WHEN actual_class = class_id THEN 1 ELSE 0 END, and average POWER(predicted_proba - indicator, 2) * K for the sum-over-classes convention. Pick a convention, document it next to the query, and never switch mid-quarter.