How to calculate log loss in SQL
Contents:
What log loss is and why it matters
A staff data scientist at Stripe ships a new fraud model on Thursday. The shadow dashboard shows accuracy up two points and precision steady at the production threshold. Friday payments operations pings saying recent chargeback decisions feel "overconfident in a weird way." Log loss surfaces this pathology before the message arrives, because it punishes a model harder for being wrong with confidence than for being wrong with hesitation.
Log loss, also called binary cross-entropy, is the canonical objective function for logistic regression, gradient-boosted classifiers, and any neural network with a sigmoid output. A 0.99 score on an event that did not happen is a near-catastrophic miss; a 0.51 score on the same event is a small one. That asymmetry makes log loss the metric of choice for downstream systems that multiply the predicted probability by money, credit, or risk capacity.
Analysts at Meta, DoorDash, Airbnb, and Databricks reach for log loss when the model output flows into automation. Accuracy collapses confidence into a binary decision, AUC measures only ranking, and Brier squares a bounded residual. Log loss is the only one of the four that maps cleanly to the negative log-likelihood the model was trained to minimize — the apples-to-apples scoring rule when comparing a logistic regression and a gradient-boosted classifier on the same window.
The log loss formula
For a binary classifier with predicted probabilities p_i for N examples with labels y_i in {0, 1}:
log_loss = -(1/N) * sum_i [ y_i * ln(p_i) + (1 - y_i) * ln(1 - p_i) ]The lower bound is zero, achieved only when every prediction matches its label exactly. A constant predictor returning 0.5 on a balanced dataset lands at ln(2) ≈ 0.693. A baseline returning the empirical base rate p_base scores -(p_base * ln(p_base) + (1 - p_base) * ln(1 - p_base)), about 0.198 on a 5%-positive class. Anything below the empirical baseline means the model adds information beyond the marginal rate.
For multi-class problems with K mutually exclusive classes and one-hot labels y_ik:
log_loss_multi = -(1/N) * sum_i sum_k [ y_ik * ln(p_ik) ]Only the true class contributes per row because every other y_ik is zero. In SQL, materialize the one-hot indicator with a CASE expression.
Log loss in SQL
The pattern in most warehouses — Snowflake, BigQuery, Postgres, Databricks — is a fact table with one row per scored event, a predicted probability, and a realized binary outcome. The canonical 30-day rolling query:
WITH safe_predictions AS (
SELECT
actual_label AS y,
GREATEST(1e-15, LEAST(1 - 1e-15, predicted_proba)) AS p
FROM model_predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
-AVG(y * LN(p) + (1 - y) * LN(1 - p)) AS log_loss,
COUNT(*) AS n
FROM safe_predictions;The GREATEST(1e-15, LEAST(1 - 1e-15, p)) expression is the clipping step every production log loss query needs. Without it, a stored probability of exactly 0 or 1 makes LN(0) blow up and the query returns an error or a NULL aggregate depending on the engine. The 1e-15 floor matches the scikit-learn default, so a SQL query and a Python sanity check land on the same decimal.
If your warehouse stores logits instead of probabilities, apply a sigmoid first: 1 / (1 + EXP(-logit)). Never run the formula against raw logits — log loss is defined on probabilities.
Per-segment log loss by country, channel, or model version:
WITH safe_predictions AS (
SELECT
model_version,
country,
actual_label AS y,
GREATEST(1e-15, LEAST(1 - 1e-15, predicted_proba)) AS p
FROM model_predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
model_version,
country,
-AVG(y * LN(p) + (1 - y) * LN(1 - p)) AS log_loss,
COUNT(*) AS n,
AVG(y) AS base_rate
FROM safe_predictions
GROUP BY model_version, country
HAVING COUNT(*) >= 500
ORDER BY model_version, country;The HAVING COUNT(*) >= 500 guardrail keeps a tiny sample from producing a noisy log loss that reflects sampling variance rather than real degradation. For small segments raise the floor to 1000 — log loss variance is brutal on rare-event tails.
Comparing models on the same test set
When two or more candidate models run in parallel, compare log losses on a single held-out test window so both score the same examples with the same base rate. Mixing test sets is the fastest way to publish a misleading benchmark.
WITH safe_predictions AS (
SELECT
model_name,
actual_label AS y,
GREATEST(1e-15, LEAST(1 - 1e-15, predicted_proba)) AS p
FROM model_predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
model_name,
-AVG(y * LN(p) + (1 - y) * LN(1 - p)) AS log_loss,
COUNT(*) AS n
FROM safe_predictions
GROUP BY model_name
ORDER BY log_loss;The model with the lowest log loss is the better-calibrated one on this window. The window must contain the same prediction events for every model, otherwise base rate differences contaminate the comparison. A parallel scoring pipeline that writes one row per (event_id, model_name) to the same table handles this implicitly.
Log loss differences below 0.005 on a few thousand examples are often within noise. Bootstrap across resamples to get a confidence interval, or run a paired sign test on per-row losses. A 0.32 versus 0.33 gap can disappear once you put error bars on it.
Log loss under class imbalance
When 95 percent of events are negative, a "predict the base rate" baseline already scores fairly well. A constant predictor returning p = 0.05 on a 5%-positive dataset gets a log loss near 0.198. A real model must beat that — not the ln(2) = 0.693 random baseline — to count as useful.
WITH base AS (
SELECT AVG(actual_label) AS p_pos
FROM model_predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
),
safe AS (
SELECT
actual_label AS y,
GREATEST(1e-15, LEAST(1 - 1e-15, predicted_proba)) AS p
FROM model_predictions
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
-(b.p_pos * LN(b.p_pos) + (1 - b.p_pos) * LN(1 - b.p_pos)) AS baseline_log_loss,
-AVG(s.y * LN(s.p) + (1 - s.y) * LN(1 - s.p)) AS model_log_loss,
1 - (-AVG(s.y * LN(s.p) + (1 - s.y) * LN(1 - s.p)))
/ NULLIF(-(b.p_pos * LN(b.p_pos) + (1 - b.p_pos) * LN(1 - b.p_pos)), 0)
AS log_loss_skill
FROM safe s CROSS JOIN base b
GROUP BY b.p_pos;The "log loss skill" expression mirrors the Brier skill score: 1 - model_loss / baseline_loss. Positive values mean the model adds information beyond the empirical base rate; zero matches the constant baseline; negative means worse than not modeling at all — common with stale features or label-distribution shifts.
Multi-class log loss
For a problem stored as one row per (observation_id, class_id, predicted_proba) with an actual_class field, the SQL is a straightforward extension. Materialize the one-hot indicator inline and sum across classes before averaging:
WITH safe AS (
SELECT
observation_id,
class_id,
actual_class,
GREATEST(1e-15, LEAST(1 - 1e-15, predicted_proba)) AS p,
CASE WHEN class_id = actual_class THEN 1 ELSE 0 END AS y
FROM model_predictions_multi
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
),
per_observation AS (
SELECT
observation_id,
-SUM(y * LN(p)) AS row_loss
FROM safe
GROUP BY observation_id
)
SELECT
AVG(row_loss) AS log_loss_multi,
COUNT(*) AS n
FROM per_observation;Only the true class contributes a non-zero term per row, but the SUM still scans every (observation_id, class_id) pair. Pre-aggregating into a (observation_id, true_class_log_p) summary table cuts the scan by K.
Common pitfalls
The first pitfall is forgetting to clip probabilities before applying the logarithm. A stored probability of exactly 0.0 or 1.0 evaluates LN(0) which is negative infinity, and the engine either errors out or returns NULL. Clipping at 1e-15 matches the scikit-learn default; use 1e-7 for float32 probabilities from a deep model where the smaller floor would already be flushed to zero.
The second pitfall is computing log loss on hard class predictions. If the serving layer rounds the output to 0 or 1 before logging, the formula collapses to two possible values per row and loses the entire confidence signal. Log the raw probability — the threshold-applied class belongs in a separate column.
The third pitfall is comparing log losses across populations with different base rates. A churn model scoring 0.10 on a 1%-positive segment and 0.30 on a 25%-positive segment is not strictly worse on the second — the maximum-information bound moves with the base rate. Pair raw log loss with the baseline, or report the skill score.
The fourth pitfall is confusing log loss with AUC. Log loss measures calibration; AUC measures ranking. A tree ensemble can produce a strong AUC and a mediocre log loss because ranking is correct but probabilities are systematically too confident. The fix is isotonic regression or Platt scaling, and it is invisible if you only look at AUC.
The fifth pitfall is scoring log loss on training data. Train log loss is biased low because the model has already seen those rows. Compute on a held-out test window the training pipeline never touched, and rotate the window monthly so calibration drift surfaces in dashboards rather than post-incident reviews.
Optimization tips
Log loss is cheap per row — two logarithms and a multiplication — so the bottleneck is almost always the upstream scan. Partition or cluster model_predictions on prediction_date so the rolling-window filter prunes to the days you need. On Snowflake, CLUSTER BY (prediction_date) typically takes a 30-day query from minutes to seconds.
For high-throughput models like ad ranking or real-time fraud, pre-aggregate log loss into a daily summary table with one row per (model_version, segment, date) storing sum_y_ln_p, sum_one_minus_y_ln_one_minus_p, and n. The dashboard reconstructs log loss as -(sum_y_ln_p + sum_one_minus_y_ln_one_minus_p) / n without scanning raw events.
Bound the rolling window to what stakeholders read. Sixteen weeks of weekly log loss is enough for drift monitoring; a year of daily granularity is mostly storage cost with no decision value. For a quarterly review, query the summary table, not the raw events.
Related reading
- SQL window functions interview questions — the patterns underneath every monitoring query in this post.
- How to calculate AUC ROC in SQL — the ranking metric you pair with log loss when both calibration and order matter.
- How to calculate Brier score in SQL — the bounded squared-error cousin of log loss.
- How to calculate confusion matrix in SQL — threshold-dependent counts that complement the threshold-free log loss view.
- How to calculate F1 score in SQL — the harmonic mean of precision and recall when you do pick a threshold.
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
What is the range of log loss?
Log loss is bounded below by zero, achieved only by a model that predicts the true label with probability 1 on every example. There is no fixed upper bound because a confidently wrong prediction at p = 1e-9 produces a loss near 20.7 per row, and the average grows without limit as predictions approach 0 or 1 on the wrong side. Clipped log loss on a healthy production model lives between 0.05 and 0.7, with the upper end corresponding to a balanced random baseline.
Is log loss the same as cross-entropy?
For binary classification with 0/1 labels they are the same number under different names. Cross-entropy is the general information-theoretic quantity; binary cross-entropy is its specialization to two classes. Multi-class cross-entropy generalizes to K classes with one-hot labels and per-class probabilities, matching the SQL above. Pick one name and stick with it inside the same document.
What clipping value should I use?
Use 1e-15 for double-precision probabilities and 1e-7 for float32 from a deep model. The scikit-learn default is 1e-15, so matching it makes the SQL and the Python sanity check return the same decimal. Clipping is a numerical-stability hack, not a hyperparameter — never tune it for better log loss. If clipping shifts the reported number by more than a fraction of a percent, the model is producing pathological predictions and the right fix is upstream calibration.
Why does a model with strong accuracy have a high log loss?
Accuracy collapses every prediction to a 0/1 decision at a threshold, ignoring confidence. Log loss penalizes confident mistakes much more than hesitant ones. A tree ensemble predicting 0.95 on every positive and 0.05 on every negative has the same accuracy as one predicting 0.55 and 0.45, but a much lower log loss. When the two metrics disagree, the model is usually overconfident — fix it with isotonic regression or Platt scaling.
When should I use log loss versus Brier score?
Use log loss when overconfidence is expensive, like credit decisioning or automated payouts where a 0.99 score on a wrong call moves real money. Use Brier when you want a smooth bounded metric that compares well across teams and pairs with reliability diagrams. Many teams report both and treat agreement as a sanity check; disagreement points at outliers that log loss amplifies and Brier dampens.
How do I extend log loss to multi-class models?
Multi-class log loss sums -y_ik * ln(p_ik) across all classes per observation, then averages. With a predictions table holding one row per (observation_id, class_id), materialize the indicator as CASE WHEN class_id = actual_class THEN 1 ELSE 0 END and follow the query above. Document whether you compute the per-observation or per-class average — switching mid-quarter invalidates the trendline.