How to calculate F1 score in SQL
Contents:
Why F1 score matters
Picture the scenario every applied ML interviewer at Meta, Stripe, or Airbnb loves to spring on a candidate: your fraud model flags 0.4% of transactions, the dashboard shows 99.6% accuracy, and the risk team keeps complaining. F1 score is the harmonic mean of precision and recall — the single most informative number for a classifier when the positive class is rare.
Accuracy is dishonest under imbalance. A model that predicts "not fraud" for every transaction hits 99.6% accuracy on a 0.4% fraud dataset and tells you nothing about catching real fraud. Precision tells you what fraction of your alerts were real. Recall tells you what fraction of real positives you caught. F1 forces you to be good at both — the harmonic mean punishes lopsided scores so hard that 0.99 precision and 0.01 recall land at F1 = 0.02, not the 0.5 an arithmetic mean would give.
Interviewers ask you to write F1 in SQL rather than call a library because they want to see you handle the confusion matrix at the row level. Every analyst can paste a one-liner from sklearn. Far fewer can structure a CTE that derives TP, FP, FN, and TN over a 30-day prediction log without silently dividing by zero. This guide walks the SQL pattern that holds up in production, the multi-class extension, the threshold sweep, and the five mistakes that show up in every code review.
The SQL formula
Lock in the math first. Precision is TP / (TP + FP). Recall is TP / (TP + FN). F1 is the harmonic mean, 2 * P * R / (P + R), which simplifies to 2 * TP / (2 * TP + FP + FN) — the compact form you want in SQL because it skips the intermediate division and only nulls out when there are no positives in the window.
For weekly retrains, a rolling 30-day window is the canonical default. For daily-scored models with a short evaluation lag, 7 to 14 days is typical. Pick the window, write the confusion matrix once, and reuse it everywhere downstream.
WITH confusion AS (
SELECT
COUNT(*) FILTER (WHERE predicted = 1 AND actual = 1) AS tp,
COUNT(*) FILTER (WHERE predicted = 1 AND actual = 0) AS fp,
COUNT(*) FILTER (WHERE predicted = 0 AND actual = 1) AS fn,
COUNT(*) FILTER (WHERE predicted = 0 AND actual = 0) AS tn
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
tp,
fp,
fn,
tn,
tp::NUMERIC / NULLIF(tp + fp, 0) AS precision,
tp::NUMERIC / NULLIF(tp + fn, 0) AS recall,
2.0 * tp / NULLIF(2 * tp + fp + fn, 0) AS f1
FROM confusion;Three details interviewers check. First, the ::NUMERIC cast — without it Postgres returns integer division and zero whenever TP is smaller than the denominator. Second, NULLIF on every division so a degenerate TP + FP = 0 returns NULL instead of erroring. Third, the compact F1 form is what production dashboards at Snowflake and Databricks use because it is one less subexpression to inspect when something looks wrong.
Macro vs micro for multi-class
Macro F1 averages per-class F1 with equal weight. Micro F1 sums TP, FP, FN across all classes, then computes one global F1. Macro is sensitive to rare classes — if the model is great at the dominant class and terrible at a 2% class, macro F1 drops hard. Micro is dominated by whichever class has the most observations, so the majority class drowns out the minority signal.
The pattern in SQL is a CROSS JOIN between predictions and the set of labels, then per-class confusion counts, then a final aggregation.
WITH classes AS (
SELECT DISTINCT actual AS class_name
FROM predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
),
per_class AS (
SELECT
c.class_name,
COUNT(*) FILTER (WHERE p.predicted = c.class_name AND p.actual = c.class_name) AS tp,
COUNT(*) FILTER (WHERE p.predicted = c.class_name AND p.actual <> c.class_name) AS fp,
COUNT(*) FILTER (WHERE p.predicted <> c.class_name AND p.actual = c.class_name) AS fn
FROM predictions p
CROSS JOIN classes c
WHERE p.prediction_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.class_name
),
per_class_f1 AS (
SELECT
class_name,
2.0 * tp / NULLIF(2 * tp + fp + fn, 0) AS f1
FROM per_class
)
SELECT
AVG(f1) AS macro_f1,
2.0 * SUM(tp) / NULLIF(SUM(2 * tp + fp + fn), 0) AS micro_f1
FROM per_class_f1
JOIN per_class USING (class_name);Always report both. If macro F1 is materially lower than micro F1, the model is failing on a minority class — the conversation needs to shift to class weighting, resampling, or a different loss. If they agree, the model is fairly uniform and you can focus on the global threshold.
Threshold sweep for probabilistic models
Most modern classifiers output a calibrated probability between 0 and 1, not a hard prediction. The 0.5 default cutoff is a convention, not an optimum. Sweep across candidate cutoffs, compute the confusion matrix at each, and pick the maximum F1.
WITH thresholds AS (
SELECT unnest(ARRAY[0.05, 0.1, 0.2, 0.3, 0.4,
0.5, 0.6, 0.7, 0.8, 0.9, 0.95]) AS threshold
),
f1_per_threshold AS (
SELECT
t.threshold,
COUNT(*) FILTER (WHERE p.score >= t.threshold AND p.actual = 1) AS tp,
COUNT(*) FILTER (WHERE p.score >= t.threshold AND p.actual = 0) AS fp,
COUNT(*) FILTER (WHERE p.score < t.threshold AND p.actual = 1) AS fn
FROM thresholds t
CROSS JOIN predictions p
WHERE p.prediction_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY t.threshold
)
SELECT
threshold,
tp,
fp,
fn,
tp::NUMERIC / NULLIF(tp + fp, 0) AS precision,
tp::NUMERIC / NULLIF(tp + fn, 0) AS recall,
2.0 * tp / NULLIF(2 * tp + fp + fn, 0) AS f1
FROM f1_per_threshold
ORDER BY threshold;Inspect the table by hand. You usually see a smooth curve peaking between 0.3 and 0.6 — a peak below 0.3 means the model is conservative and you should widen the alert net, a peak above 0.7 means it is aggressive and you should tighten. Lock the chosen threshold in config and recompute on every retrain.
F-beta variations
F1 treats precision and recall as equally important, which is rarely true in production. F-beta generalizes: F-beta = (1 + beta^2) * P * R / (beta^2 * P + R). Beta below 1 weights precision more — right when false positives are expensive, like a fraud alert that freezes a customer account. Beta above 1 weights recall more — right when false negatives are catastrophic, like a missed security incident at Microsoft.
-- F2 (recall priority): beta = 2, beta^2 = 4
SELECT
threshold,
5.0 * tp / NULLIF(5 * tp + 4 * fn + fp, 0) AS f2
FROM f1_per_threshold
ORDER BY threshold;The same template works for F0.5 with beta^2 = 0.25. Pick the beta that matches the asymmetric cost of your business problem before tuning the threshold — otherwise you are optimizing the wrong objective.
Common pitfalls
The first pitfall is reading F1 as accuracy by another name. F1 ignores true negatives entirely, which is the whole point under imbalance — true negatives are cheap and abundant in a fraud dataset, so a metric that rewards them rewards predicting nothing. Always report F1 alongside the raw confusion counts so a stakeholder can see the asymmetry between TP, FP, and FN.
The second pitfall is the silent integer divide. Postgres and several other engines default to integer division, and TP / (TP + FP) returns zero whenever TP is smaller than the denominator. Production dashboards ship with a hardcoded zero in the precision column and nobody notices until the quarterly review. Cast to NUMERIC on every metric, and pair every division with NULLIF so the zero-denominator case returns NULL instead of erroring.
The third pitfall is ignoring class imbalance when picking a threshold. If your positive class is 1% of the data, the default 0.5 cutoff is almost certainly wrong, and the threshold sweep is not optional. The optimal threshold also drifts over time as the underlying distribution changes — bake the sweep into the retrain pipeline and log the chosen threshold next to the model version so you can debug drift later.
The fourth pitfall is conflating per-class F1 with macro and micro. A 12-class model with 0.85 macro F1 and 0.92 micro F1 is failing on at least one class. Most candidates report only the aggregate and miss the diagnostic signal. Pull the per-class F1 alongside the aggregates, sort by ascending F1, and inspect the bottom three classes — that is where the next iteration of feature engineering or label correction pays off.
The fifth pitfall is using F1 for the wrong problem. F1 is a classification metric. For regression use RMSE, MAE, or R-squared. For ranking use NDCG, MAP, or MRR. For recommendation use precision-at-k or recall-at-k. Reaching for F1 outside classification is a signal that the problem framing needs another pass before any metric is chosen.
Optimization tips
For large prediction logs the expensive part is the CROSS JOIN in the threshold sweep — O(predictions x thresholds). At Notion or DoorDash scale that can be billions of rows. Push the date filter into a CTE that materializes once so the cross join scans the filtered set. On Snowflake or BigQuery, switch to a single-pass approach where each prediction contributes to all thresholds it crosses — the engine sorts by score once and the threshold logic becomes a running count.
Partition the predictions table by prediction_date if you control the schema — the rolling-window scan becomes partition pruning. On Postgres, a BRIN index on prediction_date gets most of the same benefit without the maintenance overhead.
If you ship F1 as a live dashboard metric, cache the daily confusion matrix and aggregate up. Compute (date, tp, fp, fn, tn) per day into a summary table, then sum across the date window before computing precision, recall, and F1 at read time. This is how analytics teams at OpenAI and Anthropic structure model evaluation pipelines — heavy aggregation runs once, dashboard reads are cheap, any historical window is reproducible from the cached counts.
Related reading
- How to calculate AUC ROC in SQL
- How to calculate confusion matrix in SQL
- How to calculate Brier score in SQL
- How to calculate Cohen's kappa in SQL
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
What is a good F1 score?
There is no universal threshold. Spam filters at scale routinely hit 0.95 because the features are clean. Fraud detection lives in the 0.55 to 0.80 band — the adversary is adaptive and the positive class is rare. Medical diagnosis targets 0.80 or higher, but the precise number is set by the recall constraint the clinical team imposes. Benchmark against the null model and the previous production model — relative improvement matters more than the absolute number.
What is the difference between F1 and accuracy?
Accuracy is the fraction of predictions that were correct, counting TP and TN equally. F1 ignores TN and focuses on the positive class. The gap widens as imbalance grows. On a balanced dataset the two are usually within a few points. On a 1% positive class, accuracy can be 0.99 while F1 is 0.05 — the model is right almost every time but useless because it never predicts the positive class. Use accuracy only when classes are balanced and the cost of FP and FN is symmetric.
When should I use macro F1 versus micro F1?
Macro F1 fits when every class matters equally — fraud taxonomy with five fraud types, or content moderation where a small policy category is as important as a large one. Micro F1 fits when you care about the global error rate and are happy to let the majority class drive the number. Report both. If you can only show one number to a non-technical stakeholder, macro F1 is safer because it surfaces minority-class failures that micro F1 hides.
F1 ignores true negatives, so what evaluates the negative class?
Balanced accuracy is the simplest complement — it averages sensitivity and specificity, so a model that ignores negatives gets punished. AUC ROC integrates over all thresholds. Fraud and abuse teams at Uber and DoorDash track F1 alongside specificity at a fixed recall target — F1 covers the positive class and specificity covers legitimate users.
How does F1 change as I move the threshold?
Precision and recall move in opposite directions as the threshold sweeps from 0 to 1. At threshold = 0, every prediction is positive, so recall is 1 and precision equals the base rate. At threshold = 1, almost nothing is positive, so precision is high but recall collapses. F1 peaks somewhere in the middle, and the location depends on calibration. A well-calibrated model usually peaks near 0.4 to 0.5 on a balanced problem and lower on an imbalanced one.
Why is F1 a harmonic mean rather than an arithmetic mean?
The harmonic mean punishes lopsided values. With 0.99 precision and 0.01 recall, the arithmetic mean is 0.5 — looks fine until you realize the model catches 1% of positives. The harmonic mean returns 0.02, correctly reflecting that the model is useless on the positive class. F1 only goes up when precision and recall improve together — exactly the signal you want when tuning a classifier under imbalance.