May 18, 2026·10 min read

How to calculate confusion matrix in SQL

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Contents:

Why confusion matrix belongs in your SQL toolbox
The data model
Binary confusion matrix
Derived metrics in one query
Per-segment confusion matrices
Multi-class confusion matrix
Threshold sweep
Common pitfalls
Optimization tips
A Python sanity check
Related reading
FAQ

Why confusion matrix belongs in your SQL toolbox

"Our fraud model has 95 percent accuracy" lands in a Stripe or Airbnb standup, and the senior data scientist asks the same question every time: "what's the confusion matrix?" Accuracy alone flatters every model on imbalanced data — predict "not fraud" on every transaction and you clear 99.7 percent without catching a single bad actor. The four cells (TP, FP, TN, FN) are the smallest faithful summary of a binary classifier, and every other metric you'll defend in a review — precision, recall, F1, specificity — is a ratio of those four numbers.

Most analysts compute the matrix in pandas. Fine for a one-off slide; wrong tool when the model produces millions of predictions a day in Snowflake or Databricks, or when the product team wants a per-segment breakdown next to latency and error rate. In those settings the matrix lives in SQL. This post walks the queries you'll ship: binary matrix, derived metrics in one pass, per-segment slice, multi-class, and threshold sweep. Examples run in Postgres with minor edits for Snowflake, BigQuery, and Redshift.

The data model

A single fact table of predictions joined back to the realized outcome.

predictions (prediction_id, entity_id, predicted_proba, actual_label, prediction_date)

predicted_proba is the model's continuous output. actual_label is the realized binary outcome. prediction_date is when the model scored the example; the label may arrive minutes later for clicks, weeks later for churn or chargebacks.

Binary confusion matrix

For a binary classifier with a 0.5 threshold:

WITH thresholded AS (
    SELECT
        actual_label,
        CASE WHEN predicted_proba >= 0.5 THEN 1 ELSE 0 END AS predicted_label
    FROM predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
      AND actual_label IS NOT NULL
)
SELECT
    SUM(CASE WHEN actual_label = 1 AND predicted_label = 1 THEN 1 ELSE 0 END) AS tp,
    SUM(CASE WHEN actual_label = 0 AND predicted_label = 1 THEN 1 ELSE 0 END) AS fp,
    SUM(CASE WHEN actual_label = 0 AND predicted_label = 0 THEN 1 ELSE 0 END) AS tn,
    SUM(CASE WHEN actual_label = 1 AND predicted_label = 0 THEN 1 ELSE 0 END) AS fn
FROM thresholded;

Four aggregates, one pass. The actual_label IS NOT NULL filter is not cosmetic — unresolved labels show up as NULL, and comparing NULL to 0 or 1 silently drops the row, biasing every cell. The 0.5 threshold is a default; if your operating point is "catch 70 percent of fraud," the right threshold is whatever yields that recall on the holdout set.

Derived metrics in one query

Every classification metric is a ratio of the four cells. Compute them in a single pass:

WITH cm AS (
    SELECT
        SUM(CASE WHEN actual_label = 1 AND predicted_label = 1 THEN 1 ELSE 0 END) AS tp,
        SUM(CASE WHEN actual_label = 0 AND predicted_label = 1 THEN 1 ELSE 0 END) AS fp,
        SUM(CASE WHEN actual_label = 0 AND predicted_label = 0 THEN 1 ELSE 0 END) AS tn,
        SUM(CASE WHEN actual_label = 1 AND predicted_label = 0 THEN 1 ELSE 0 END) AS fn
    FROM predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
      AND actual_label IS NOT NULL
)
SELECT
    tp, fp, tn, fn,
    (tp + tn)::NUMERIC / NULLIF(tp + fp + tn + fn, 0) AS accuracy,
    tp::NUMERIC       / NULLIF(tp + fp, 0)            AS precision,
    tp::NUMERIC       / NULLIF(tp + fn, 0)            AS recall,
    tn::NUMERIC       / NULLIF(tn + fp, 0)            AS specificity,
    2.0 * tp::NUMERIC / NULLIF(2.0 * tp + fp + fn, 0) AS f1_score,
    (tp + fn)::NUMERIC / NULLIF(tp + fp + tn + fn, 0) AS base_rate
FROM cm;

NULLIF is doing real work — on a quiet hour with a new model you can land on a window with zero positives, and tp + fn = 0 would crash without it. The base_rate column is the trick most analysts skip: a 0.20 precision on a 0.5 percent base rate is forty times better than chance; the same 0.20 on a 30 percent base rate is broken.

Per-segment confusion matrices

A global matrix that hides a failing slice is the most common silent model failure. Group by segment to surface it:

SELECT
    e.segment,
    SUM(CASE WHEN p.actual_label = 1 AND p.predicted_label = 1 THEN 1 ELSE 0 END) AS tp,
    SUM(CASE WHEN p.actual_label = 0 AND p.predicted_label = 1 THEN 1 ELSE 0 END) AS fp,
    SUM(CASE WHEN p.actual_label = 0 AND p.predicted_label = 0 THEN 1 ELSE 0 END) AS tn,
    SUM(CASE WHEN p.actual_label = 1 AND p.predicted_label = 0 THEN 1 ELSE 0 END) AS fn,
    SUM(CASE WHEN p.actual_label = 1 AND p.predicted_label = 1 THEN 1 ELSE 0 END)::NUMERIC
        / NULLIF(SUM(CASE WHEN p.predicted_label = 1 THEN 1 ELSE 0 END), 0) AS precision_by_segment
FROM predictions p
JOIN entities    e USING (entity_id)
WHERE p.prediction_date >= CURRENT_DATE - INTERVAL '30 days'
  AND p.actual_label IS NOT NULL
GROUP BY e.segment
HAVING SUM(CASE WHEN p.actual_label = 1 THEN 1 ELSE 0 END) >= 30
ORDER BY precision_by_segment ASC;

Sort ascending so the worst slice surfaces first. iOS precision 0.92 and Android precision 0.45 on the same model means retraining is overdue. The HAVING ... >= 30 filter is not optional — a segment with three positives is noise, and a 1.00 precision from one lucky flag is the dashboard that gets a model promoted on Friday and rolled back on Monday.

Multi-class confusion matrix

For an N-class classifier (sentiment, intent, category), the matrix is N×N. Build the long form with one pair of aggregations:

SELECT
    actual_label    AS actual,
    predicted_label AS predicted,
    COUNT(*)        AS n
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
  AND actual_label IS NOT NULL
GROUP BY actual_label, predicted_label
ORDER BY actual_label, predicted_label;

The diagonal (actual = predicted) is correct classifications; off-diagonal cells are the confusions that matter. In a five-class intent model, actual = "billing", predicted = "support" tells you the model routes billing tickets to general support — a fixable failure mode. For per-class precision:

WITH per_class AS (
    SELECT
        predicted_label AS class,
        SUM(CASE WHEN actual_label = predicted_label THEN 1 ELSE 0 END) AS tp,
        SUM(CASE WHEN actual_label <> predicted_label THEN 1 ELSE 0 END) AS fp
    FROM predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
      AND actual_label IS NOT NULL
    GROUP BY predicted_label
)
SELECT
    class,
    tp,
    fp,
    tp::NUMERIC / NULLIF(tp + fp, 0) AS precision
FROM per_class
ORDER BY precision ASC;

Macro precision is the simple mean of per-class precision; micro precision is computed from total TP and total FP. They can diverge by ten points on imbalanced multi-class data — the right choice depends on whether business value is per-class or proportional to volume.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

Threshold sweep

The 0.5 threshold is rarely optimal. Sweep thresholds to find the one that matches business cost:

WITH thresholds AS (
    SELECT generate_series(0.05, 0.95, 0.05)::NUMERIC AS thr
)
SELECT
    t.thr,
    SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END) AS tp,
    SUM(CASE WHEN p.actual_label = 0 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END) AS fp,
    SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba <  t.thr THEN 1 ELSE 0 END) AS fn,
    SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END)::NUMERIC
        / NULLIF(SUM(CASE WHEN p.predicted_proba >= t.thr THEN 1 ELSE 0 END), 0) AS precision_at,
    SUM(CASE WHEN p.actual_label = 1 AND p.predicted_proba >= t.thr THEN 1 ELSE 0 END)::NUMERIC
        / NULLIF(SUM(CASE WHEN p.actual_label = 1 THEN 1 ELSE 0 END), 0)         AS recall_at
FROM predictions p
CROSS JOIN thresholds t
WHERE p.prediction_date >= CURRENT_DATE - INTERVAL '30 days'
  AND p.actual_label IS NOT NULL
GROUP BY t.thr
ORDER BY t.thr;

For a spam filter, false positives are catastrophic — a real email in spam erodes trust. Push the threshold up until precision is above 0.99. For fraud, false negatives are catastrophic — a missed chargeback costs the merchant. Push the threshold down until recall covers the agreed loss rate.

Common pitfalls

The first trap is leaving NULL labels in the aggregation. A NULL actual_label compared with = 1 evaluates to UNKNOWN, not TRUE, and the row silently falls out of every CASE WHEN. Cells still sum to a plausible number, just the wrong one. Filter actual_label IS NOT NULL and document the label-freshness lag so the next analyst doesn't run the matrix on a half-resolved window.

The second trap is computing the matrix on probabilities without thresholding. The four-cell matrix is defined on discrete labels; actual_label = predicted_proba compares an integer to a float and almost never returns TRUE. Threshold first, or use a probability-aware metric like Brier score or AUC.

The third trap is ignoring class imbalance when reading accuracy. On a 0.3 percent positive base rate, predicting all-negative yields 99.7 percent accuracy and a matrix that's a vertical strip — the model is useless. Report precision, recall, and base rate together; never accuracy alone on imbalanced data.

The fourth trap is averaging per-class metrics without thinking about which average. Macro precision treats every class equally; weighted precision averages by support; micro precision is computed from total TP and total FP. The three can diverge by ten points on imbalanced data — the interview question is whether you know they differ.

Optimization tips

Partition the predictions table by prediction_date. Every confusion matrix query is range-bounded by time, so partition pruning collapses a "last 30 days" scan to one or two daily partitions when the filter is on the partition key.

Materialize per-segment per-day aggregates. Store tp, fp, tn, fn, and n per (segment, date). Every dashboard query then reduces to a SUM across rows and a divide — sixteen weeks of metrics goes from a multi-minute scan to a sub-second roll-up. For threshold sweeps, batch the threshold list to ten or twenty values, not fifty; the precision-recall curve is smooth enough that twenty points is plenty for a slide.

A Python sanity check

When the SQL matrix looks off, recompute in scikit-learn and compare. The two should agree exactly on the same threshold and window.

from sklearn.metrics import confusion_matrix, classification_report

tn, fp, fn, tp = confusion_matrix(actual, predicted).ravel()
print(classification_report(actual, predicted))

If cells differ by even one, the cause is usually NULL handling — scikit-learn raises on NULL by default, SQL silently drops. The second most likely cause is a threshold mismatch — sklearn defaults to argmax for predict(), but the SQL applied a hard 0.5 cutoff. Reconcile on the same threshold and the same non-null subset.

How to calculate AUC ROC in SQL — when threshold-free ranking quality is the question.
How to calculate Brier score in SQL — calibration measurement the confusion matrix cannot give you.
SQL window functions interview questions — RANK, LAG, and the patterns underneath threshold sweep queries.

If you want to drill SQL like this every day, NAILDD is launching with hundreds of analytics and ML-flavored problems.

FAQ

What threshold should I use to build a confusion matrix?

Start with 0.5 for a sanity check; ship the threshold that matches business cost. A spam filter where false positives erode trust runs at 0.9 or higher; a fraud detector where false negatives cost real dollars runs at 0.2 or lower. The right threshold comes from the precision-recall curve and the cost ratio of FP to FN, not from convention.

How do I read a multi-class confusion matrix with N classes?

The diagonal is correct classifications. Off-diagonal cells are the confusions that matter. Confusions tend to cluster among semantically similar classes — intents sharing trigger words, product categories overlapping in attributes. The fix is usually a feature engineering pass on the confused pair, not a global hyperparameter tweak.

What's the difference between per-class precision and macro precision?

Per-class precision is computed independently for each class — N numbers for N classes. Macro precision is the unweighted mean. Weighted precision is the mean weighted by class support. The three diverge on imbalanced data: per-class for diagnostics, macro when every class matters equally, weighted when business value is proportional to support.

Why is F1 = 2 × precision × recall / (precision + recall)?

F1 is the harmonic mean of precision and recall. The harmonic mean penalizes the smaller value more than the arithmetic mean would — a model with precision 1.0 and recall 0.01 has an arithmetic mean of 0.505 but an F1 of 0.02. The model that catches one fraud out of a thousand is not "good on average," it's broken on recall, and the metric should say so.

When is accuracy the wrong metric to report?

On any dataset with class imbalance — fraud, churn, click-through, medical screening. A 99 percent positive or 1 percent positive class collapses accuracy to "predict the majority." Report the four-cell matrix and the base rate; let the consumer compute the ratios. Accuracy is only honest when classes are roughly balanced and the costs of FP and FN are equal.

Can I compute the confusion matrix on continuous probabilities directly?

No — the four-cell structure is defined on discrete labels. If you have only probabilities, threshold first or use a probability-aware metric. The closest probabilistic analogs are the Brier score (mean squared error of probability versus outcome) and AUC ROC (threshold-free ranking quality). For calibration, a reliability diagram with binned probabilities is the right tool.