How to calculate confusion matrix in SQL
Contents:
Why confusion matrix belongs in your SQL toolbox
"Our fraud model has 95 percent accuracy" lands in a Stripe or Airbnb standup, and the senior data scientist asks the same question every time: "what's the confusion matrix?" Accuracy alone flatters every model on imbalanced data — predict "not fraud" on every transaction and you clear 99.7 percent without catching a single bad actor. The four cells (TP, FP, TN, FN) are the smallest faithful summary of a binary classifier, and every other metric you'll defend in a review — precision, recall, F1, specificity — is a ratio of those four numbers.
Most analysts compute the matrix in pandas. Fine for a one-off slide; wrong tool when the model produces millions of predictions a day in Snowflake or Databricks, or when the product team wants a per-segment breakdown next to latency and error rate. In those settings the matrix lives in SQL. This post walks the queries you'll ship: binary matrix, derived metrics in one pass, per-segment slice, multi-class, and threshold sweep. Examples run in Postgres with minor edits for Snowflake, BigQuery, and Redshift.
The data model
A single fact table of predictions joined back to the realized outcome.
predictions (prediction_id, entity_id, predicted_proba, actual_label, prediction_date)predicted_proba is the model's continuous output. actual_label is the realized binary outcome. prediction_date is when the model scored the example; the label may arrive minutes later for clicks, weeks later for churn or chargebacks.
Binary confusion matrix
For a binary classifier with a 0.5 threshold:
WITH thresholded AS (
SELECT
actual_label,
CASE WHEN predicted_proba >= 0.5 THEN 1 ELSE 0 END AS predicted_label
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
AND actual_label IS NOT NULL
)
SELECT
SUM(CASE WHEN actual_label = 1 AND predicted_label = 1 THEN 1 ELSE 0 END) AS tp,
SUM(CASE WHEN actual_label = 0 AND predicted_label = 1 THEN 1 ELSE 0 END) AS fp,
SUM(CASE WHEN actual_label = 0 AND predicted_label = 0 THEN 1 ELSE 0 END) AS tn,
SUM(CASE WHEN actual_label = 1 AND predicted_label = 0 THEN 1 ELSE 0 END) AS fn
FROM thresholded;Four aggregates, one pass. The actual_label IS NOT NULL filter is not cosmetic — unresolved labels show up as NULL, and comparing NULL to 0 or 1 silently drops the row, biasing every cell. The 0.5 threshold is a default; if your operating point is "catch 70 percent of fraud," the right threshold is whatever yields that recall on the holdout set.
Derived metrics in one query
Every classification metric is a ratio of the four cells. Compute them in a single pass:
WITH cm AS (
SELECT
SUM(CASE WHEN actual_label = 1 AND predicted_label = 1 THEN 1 ELSE 0 END) AS tp,
SUM(CASE WHEN actual_label = 0 AND predicted_label = 1 THEN 1 ELSE 0 END) AS fp,
SUM(CASE WHEN actual_label = 0 AND predicted_label = 0 THEN 1 ELSE 0 END) AS tn,
SUM(CASE WHEN actual_label = 1 AND predicted_label = 0 THEN 1 ELSE 0 END) AS fn
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
AND actual_label IS NOT NULL
)
SELECT
tp, fp, tn, fn,
(tp + tn)::NUMERIC / NULLIF(tp + fp + tn + fn, 0) AS accuracy,
tp::NUMERIC / NULLIF(tp + fp, 0) AS precision,
tp::NUMERIC / NULLIF(tp + fn, 0) AS recall,
tn::NUMERIC / NULLIF(tn + fp, 0) AS specificity,
2.0 * tp::NUMERIC / NULLIF(2.0 * tp + fp + fn, 0) AS f1_score,
(tp + fn)::NUMERIC / NULLIF(tp + fp + tn + fn, 0) AS base_rate
FROM cm;NULLIF is doing real work — on a quiet hour with a new model you can land on a window with zero positives, and tp + fn = 0 would crash without it. The base_rate column is the trick most analysts skip: a 0.20 precision on a 0.5 percent base rate is forty times better than chance; the same 0.20 on a 30 percent base rate is broken.
Per-segment confusion matrices
A global matrix that hides a failing slice is the most common silent model failure. Group by segment to surface it:
SELECT
e.segment,
SUM(CASE WHEN p.actual_label = 1 AND p.predicted_label = 1 THEN 1 ELSE 0 END) AS tp,
SUM(CASE WHEN p.actual_label = 0 AND p.predicted_label = 1 THEN 1 ELSE 0 END) AS fp,
SUM(CASE WHEN p.actual_label = 0 AND p.predicted_label = 0 THEN 1 ELSE 0 END) AS tn,
SUM(CASE WHEN p.actual_label = 1 AND p.predicted_label = 0 THEN 1 ELSE 0 END) AS fn,
SUM(CASE WHEN p.actual_label = 1 AND p.predicted_label = 1 THEN 1 ELSE 0 END)::NUMERIC
/ NULLIF(SUM(CASE WHEN p.predicted_label = 1 THEN 1 ELSE 0 END), 0) AS precision_by_segment
FROM predictions p
JOIN entities e USING (entity_id)
WHERE p.prediction_date >= CURRENT_DATE - INTERVAL '30 days'
AND p.actual_label IS NOT NULL
GROUP BY e.segment
HAVING SUM(CASE WHEN p.actual_label = 1 THEN 1 ELSE 0 END) >= 30
ORDER BY precision_by_segment ASC;Sort ascending so the worst slice surfaces first. iOS precision 0.92 and Android precision 0.45 on the same model means retraining is overdue. The HAVING ... >= 30 filter is not optional — a segment with three positives is noise, and a 1.00 precision from one lucky flag is the dashboard that gets a model promoted on Friday and rolled back on Monday.
Multi-class confusion matrix
For an N-class classifier (sentiment, intent, category), the matrix is N×N. Build the long form with one pair of aggregations:
SELECT
actual_label AS actual,
predicted_label AS predicted,
COUNT(*) AS n
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
AND actual_label IS NOT NULL
GROUP BY actual_label, predicted_label
ORDER BY actual_label, predicted_label;The diagonal (actual = predicted) is correct classifications; off-diagonal cells are the confusions that matter. In a five-class intent model, actual = "billing", predicted = "support" tells you the model routes billing tickets to general support — a fixable failure mode. For per-class precision:
WITH per_class AS (
SELECT
predicted_label AS class,
SUM(CASE WHEN actual_label = predicted_label THEN 1 ELSE 0 END) AS tp,
SUM(CASE WHEN actual_label <> predicted_label THEN 1 ELSE 0 END) AS fp
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
AND actual_label IS NOT NULL
GROUP BY predicted_label
)
SELECT
class,
tp,
fp,
tp::NUMERIC / NULLIF(tp + fp, 0) AS precision
FROM per_class
ORDER BY precision ASC;Macro precision is the simple mean of per-class precision; micro precision is computed from total TP and total FP. They can diverge by ten points on imbalanced multi-class data — the right choice depends on whether business value is per-class or proportional to volume.
Threshold sweep
The 0.5 threshold is rarely optimal. Sweep thresholds to find the one that matches business cost:
WITH thresholds AS (
SELECT generate_series(0.05, 0.95, 0.05)::NUMERIC AS thr
)
SELECT
t.thr,
SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END) AS tp,
SUM(CASE WHEN p.actual_label = 0 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END) AS fp,
SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba < t.thr THEN 1 ELSE 0 END) AS fn,
SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END)::NUMERIC
/ NULLIF(SUM(CASE WHEN p.predicted_proba >= t.thr THEN 1 ELSE 0 END), 0) AS precision_at,
SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END)::NUMERIC
/ NULLIF(SUM(CASE WHEN p.actual_label = 1 THEN 1 ELSE 0 END), 0) AS recall_at
FROM predictions p
CROSS JOIN thresholds t
WHERE p.prediction_date >= CURRENT_DATE - INTERVAL '30 days'
AND p.actual_label IS NOT NULL
GROUP BY t.thr
ORDER BY t.thr;For a spam filter, false positives are catastrophic — a real email in spam erodes trust. Push the threshold up until precision is above 0.99. For fraud, false negatives are catastrophic — a missed chargeback costs the merchant. Push the threshold down until recall covers the agreed loss rate.
Common pitfalls
The first trap is leaving NULL labels in the aggregation. A NULL actual_label compared with = 1 evaluates to UNKNOWN, not TRUE, and the row silently falls out of every CASE WHEN. Cells still sum to a plausible number, just the wrong one. Filter actual_label IS NOT NULL and document the label-freshness lag so the next analyst doesn't run the matrix on a half-resolved window.
The second trap is computing the matrix on probabilities without thresholding. The four-cell matrix is defined on discrete labels; actual_label = predicted_proba compares an integer to a float and almost never returns TRUE. Threshold first, or use a probability-aware metric like Brier score or AUC.
The third trap is ignoring class imbalance when reading accuracy. On a 0.3 percent positive base rate, predicting all-negative yields 99.7 percent accuracy and a matrix that's a vertical strip — the model is useless. Report precision, recall, and base rate together; never accuracy alone on imbalanced data.
The fourth trap is averaging per-class metrics without thinking about which average. Macro precision treats every class equally; weighted precision averages by support; micro precision is computed from total TP and total FP. The three can diverge by ten points on imbalanced data — the interview question is whether you know they differ.
Optimization tips
Partition the predictions table by prediction_date. Every confusion matrix query is range-bounded by time, so partition pruning collapses a "last 30 days" scan to one or two daily partitions when the filter is on the partition key.
Materialize per-segment per-day aggregates. Store tp, fp, tn, fn, and n per (segment, date). Every dashboard query then reduces to a SUM across rows and a divide — sixteen weeks of metrics goes from a multi-minute scan to a sub-second roll-up. For threshold sweeps, batch the threshold list to ten or twenty values, not fifty; the precision-recall curve is smooth enough that twenty points is plenty for a slide.
A Python sanity check
When the SQL matrix looks off, recompute in scikit-learn and compare. The two should agree exactly on the same threshold and window.
from sklearn.metrics import confusion_matrix, classification_report
tn, fp, fn, tp = confusion_matrix(actual, predicted).ravel()
print(classification_report(actual, predicted))If cells differ by even one, the cause is usually NULL handling — scikit-learn raises on NULL by default, SQL silently drops. The second most likely cause is a threshold mismatch — sklearn defaults to argmax for predict(), but the SQL applied a hard 0.5 cutoff. Reconcile on the same threshold and the same non-null subset.
Related reading
- How to calculate AUC ROC in SQL — when threshold-free ranking quality is the question.
- How to calculate Brier score in SQL — calibration measurement the confusion matrix cannot give you.
- SQL window functions interview questions —
RANK,LAG, and the patterns underneath threshold sweep queries.
If you want to drill SQL like this every day, NAILDD is launching with hundreds of analytics and ML-flavored problems.
FAQ
What threshold should I use to build a confusion matrix?
Start with 0.5 for a sanity check; ship the threshold that matches business cost. A spam filter where false positives erode trust runs at 0.9 or higher; a fraud detector where false negatives cost real dollars runs at 0.2 or lower. The right threshold comes from the precision-recall curve and the cost ratio of FP to FN, not from convention.
How do I read a multi-class confusion matrix with N classes?
The diagonal is correct classifications. Off-diagonal cells are the confusions that matter. Confusions tend to cluster among semantically similar classes — intents sharing trigger words, product categories overlapping in attributes. The fix is usually a feature engineering pass on the confused pair, not a global hyperparameter tweak.
What's the difference between per-class precision and macro precision?
Per-class precision is computed independently for each class — N numbers for N classes. Macro precision is the unweighted mean. Weighted precision is the mean weighted by class support. The three diverge on imbalanced data: per-class for diagnostics, macro when every class matters equally, weighted when business value is proportional to support.
Why is F1 = 2 × precision × recall / (precision + recall)?
F1 is the harmonic mean of precision and recall. The harmonic mean penalizes the smaller value more than the arithmetic mean would — a model with precision 1.0 and recall 0.01 has an arithmetic mean of 0.505 but an F1 of 0.02. The model that catches one fraud out of a thousand is not "good on average," it's broken on recall, and the metric should say so.
When is accuracy the wrong metric to report?
On any dataset with class imbalance — fraud, churn, click-through, medical screening. A 99 percent positive or 1 percent positive class collapses accuracy to "predict the majority." Report the four-cell matrix and the base rate; let the consumer compute the ratios. Accuracy is only honest when classes are roughly balanced and the costs of FP and FN are equal.
Can I compute the confusion matrix on continuous probabilities directly?
No — the four-cell structure is defined on discrete labels. If you have only probabilities, threshold first or use a probability-aware metric. The closest probabilistic analogs are the Brier score (mean squared error of probability versus outcome) and AUC ROC (threshold-free ranking quality). For calibration, a reliability diagram with binned probabilities is the right tool.