Regression metrics on the data science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why regression metrics dominate DS loops

Walk into a senior data science loop at Stripe, Netflix, Airbnb, DoorDash, or Snowflake and the regression block is rarely about whether you can spell MAE. The hiring manager at Uber asks how you would measure a surge-pricing model that errs by twenty cents on a five-dollar ride but by twelve dollars on a sixty-dollar airport run. The forecasting interviewer at Databricks hands you a demand curve with a stockout floor at zero and asks whether MAPE makes sense.

Candidates lose points not because they cannot recall the formula. They lose them because they cannot say in one breath why a metric is wrong for a prompt — that MAPE blows up at zero, that R-squared is unsigned on out-of-sample data, that MSE inflates a single bad outlier into the entire training signal. This guide walks through the six regression metrics that show up over and over, and the way a senior interviewer expects you to choose between them.

MAE — mean absolute error

Mean absolute error averages the absolute residual across every prediction. The formula is the cleanest in the family and the answer is in the same units as the target, which is why product partners trust it without an interpretation table.

MAE = (1 / N) * sum( | y_i - y_hat_i | )

Two properties make MAE useful in interviews. Every observation contributes linearly, so a single outlier moves the metric by the size of its own error, not by the square of it. And the population minimizer of expected absolute error is the median, not the mean, so a model trained to optimize MAE tilts toward the conditional median. That matters when an interviewer at DoorDash asks about a long-tailed delivery-time distribution — the right answer is MAE, paired with the framing that you care about the typical experience rather than the variance-weighted average.

import numpy as np

def mae(y_true, y_pred):
    return np.mean(np.abs(np.asarray(y_true) - np.asarray(y_pred)))

The drawback is differentiability. Absolute value has a kink at zero, so MAE is not used as a loss in gradient-based learners without smoothing. Most teams report MAE and train on Huber or quantile loss as a smooth proxy.

MSE — mean squared error

Mean squared error squares each residual before averaging. The squared term punishes large misses far more than small ones — a residual of ten contributes one hundred to the sum, while a residual of one contributes one.

MSE = (1 / N) * sum( ( y_i - y_hat_i )^2 )

MSE is the workhorse training loss for linear regression and most neural-network regressors because it is convex, smooth, and has a closed-form minimizer at the conditional mean. The catch is that the units are squared. If your target is dollars, MSE is in dollars-squared, which no business partner understands — candidates who report MSE directly to a product manager rather than rooting it lose stakeholder-credibility points even if the math is right.

The squaring also means MSE is exquisitely sensitive to outliers. A single mislabelled row with a residual of one hundred contributes ten thousand to the loss — more than ten thousand well-fit rows with residual of one. When an interviewer at Tesla asks about a battery-life regressor where a few sensors return broken readings, the right answer is either Huber loss, a winsorized target, or a robust metric like MAE.

RMSE — root mean squared error

Root mean squared error takes the square root of MSE to bring the metric back into the target's units. RMSE is what most teams actually report when they want a single squared-loss number that is interpretable to humans.

RMSE = sqrt( MSE ) = sqrt( (1 / N) * sum( ( y_i - y_hat_i )^2 ) )
import numpy as np

def rmse(y_true, y_pred):
    residuals = np.asarray(y_true) - np.asarray(y_pred)
    return float(np.sqrt(np.mean(residuals ** 2)))

The interview framing for RMSE versus MAE is one of the cleanest tradeoffs in DS interviewing. RMSE is always at least as large as MAE on the same dataset, and the gap between them measures how heavy the residual tail is. Compare two price-prediction models with RMSE 18 and MAE 11 versus RMSE 14 and MAE 12 — model A has a heavier tail of large misses, model B is more uniformly mediocre, and the right pick depends on whether outsized errors hurt the marketplace more than a slightly worse average.

MAPE and sMAPE

Mean absolute percentage error scales each absolute residual by the actual value and averages.

MAPE = (100 / N) * sum( | y_i - y_hat_i | / | y_i | )

MAPE is the metric retail and forecasting teams reach for when the business wants a percentage, because a ten percent average error reads better to a CFO than "RMSE 47." Interviewers at Amazon or Uber demand to know its failure modes.

The first failure is undefined behavior at zero. If any actual value is zero — and demand series, stockouts, and event counts hit zero all the time — the denominator is zero and the term is undefined. Naive handlers drop those rows, which biases the metric toward the easier high-volume regime. The second failure is asymmetry. Under-prediction is bounded at one hundred percent, while over-prediction is unbounded — a forecast of two hundred for an actual of ten contributes 1900 percent. Models tuned to MAPE will under-forecast on average.

Symmetric MAPE — sMAPE — replaces the denominator with the average of the actual and the prediction to flatten the asymmetry.

sMAPE = (100 / N) * sum( | y_i - y_hat_i | / ( ( | y_i | + | y_hat_i | ) / 2 ) )

sMAPE fixes the asymmetry but introduces its own quirks: it is still undefined when both actual and prediction are zero, and it caps each term at two hundred percent, which can hide catastrophic outliers. When an interviewer asks "why sMAPE over MAPE" the senior answer mentions both the asymmetry fix and the capped-magnitude tradeoff.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

R-squared and adjusted R-squared

The coefficient of determination measures the proportion of variance in the target that the model explains, relative to a constant-mean baseline.

R-squared = 1 - ( sum( ( y_i - y_hat_i )^2 ) / sum( ( y_i - y_bar )^2 ) )

R-squared of one is a perfect fit, zero is no better than predicting the mean, and it can go negative on out-of-sample data — a common surprise on interviews that signals a model losing to the constant baseline.

The trap is that R-squared on training data always increases or stays flat when you add features, even garbage features, because the model has more freedom to overfit. Adjusted R-squared corrects for the number of predictors and is the right metric when comparing a small linear model against a feature-rich gradient boost on the same fold.

Adjusted R-squared = 1 - ( ( 1 - R-squared ) * ( N - 1 ) / ( N - p - 1 ) )

Where N is the number of rows and p is the number of predictors. R-squared is also not robust to outliers, and it behaves differently in-sample versus out-of-sample — using the training number to claim model quality is the cardinal sin of regression interviews.

Picking the right metric for the prompt

A senior interviewer is listening for the chain "what is the cost asymmetry, what is the unit, what is the tail." If outliers represent fraud or sensor errors that should not steer the model, you want MAE. If outliers are real and expensive — a thirty-thousand-dollar fraud prediction error versus a hundred small ones — you want RMSE or MSE so the loss reflects the business cost.

If the business partner wants a percentage and the target is strictly positive, MAPE is acceptable and sMAPE is safer. If the target can hit zero or change sign, switch to weighted absolute percentage error, mean absolute scaled error, or just report RMSE with a baseline. If the question is "how much variance does this model explain," report adjusted R-squared on a held-out fold. If the question is "how confident are we about a single prediction," none of these point metrics suffice — pivot to quantile loss, pinball loss, or prediction intervals.

import numpy as np

def regression_report(y_true, y_pred):
    y_true = np.asarray(y_true, dtype=float)
    y_pred = np.asarray(y_pred, dtype=float)
    residuals = y_true - y_pred

    mae_value = float(np.mean(np.abs(residuals)))
    mse_value = float(np.mean(residuals ** 2))
    rmse_value = float(np.sqrt(mse_value))

    mask = y_true != 0
    mape_value = float(np.mean(np.abs(residuals[mask]) / np.abs(y_true[mask])) * 100)

    ss_res = float(np.sum(residuals ** 2))
    ss_tot = float(np.sum((y_true - y_true.mean()) ** 2))
    r2_value = 1 - ss_res / ss_tot

    return {
        "mae": mae_value, "rmse": rmse_value, "mape_percent": mape_value,
        "r_squared": r2_value, "n_zero_actuals": int((~mask).sum()),
    }

Common pitfalls

The most expensive trap is reporting MSE to a non-technical stakeholder. MSE is in squared units, and one bad row dominates the loss in ways that misrepresent typical behavior. Always root MSE into RMSE before reporting to humans, and pair RMSE with MAE so the reader can see whether the tail is dragging the squared metric upward.

A second pitfall is using MAPE on a target that touches zero. MAPE divides by the actual value, so any zero blows up the term, and many candidates silently drop those rows, biasing the metric toward the high-volume regime. Declare the zero-handling rule before computing, or switch to WAPE where the denominator is summed before dividing.

A third pitfall is comparing R-squared across different target transformations. If model A predicts revenue and model B predicts log-revenue, their R-squared numbers are not on the same scale. Invert the transformation, compute RMSE on the original units, and compare in dollars. The senior interviewer will probe this — if you say "model B has higher R-squared" without checking units, you lose the round.

A fourth pitfall is reporting an in-sample number as if it generalized. R-squared on training data is biased upward by every extra feature, and any team that ships a model on in-sample fit will be embarrassed by the holdout. Every reported metric should specify the fold, and adjusted R-squared with cross-validation is the minimum bar.

A fifth pitfall is forgetting that point metrics hide calibration. Two models with identical RMSE can have wildly different reliability — one over-predicts the low end and under-predicts the high end, while the other is well-calibrated. Pair every regression metric with a calibration plot or per-decile residual table, and flag heteroscedasticity if residual variance grows with the prediction.

If you want to drill regression-metrics questions like this every day, NAILDD is launching with hundreds of DS interview problems across exactly this pattern.

FAQ

Why is RMSE always at least as large as MAE?

By Jensen's inequality applied to the convex square-root function, the root of the average of squared residuals is bounded below by the average of absolute residuals. The two are equal only when every residual has the same absolute value, which never happens in real data. The gap between them is a practical diagnostic for tail weight — when RMSE is much larger than MAE, a few big misses are doing most of the work.

Should I ever report MSE directly?

Almost never to a stakeholder, frequently to a model-training pipeline. MSE is the right loss for gradient-based training of regressors that target the conditional mean, but its units make it unfit for human-readable reporting. If an interviewer asks for MSE specifically, they are usually checking whether you understand the unit problem — answer "MSE is the training-time loss, RMSE is the reporting metric" and you have signaled the right depth.

What is the difference between MAPE and weighted absolute percentage error?

MAPE divides each absolute residual by the corresponding actual, then averages the percentages. WAPE sums absolute residuals across the dataset and divides by the sum of absolute actuals — equivalent to weighting each row's percentage error by its actual. WAPE is more stable when actuals span several orders of magnitude or include zeros, because big-volume rows dominate the denominator and small actuals do not blow up individual terms.

When does R-squared go negative and what should I tell my manager?

R-squared goes negative on out-of-sample data when the model predicts worse than the constant-mean baseline. It typically signals that the holdout regime has shifted — different time period, segment, or feature distribution — and the model has overfit. Tell your manager that the model is worse than predicting the average and that the next investigation is feature drift and target drift, not metric tuning.

Is there a single metric I should report by default?

If forced to pick one, report RMSE on a held-out fold in the original units, alongside a baseline RMSE from a constant-mean model so the reader can see the relative improvement. Pair it with a residual histogram. Candidates rarely lose a senior DS interview for picking RMSE, but they lose every time they pick a percentage metric without checking the zero behavior of the target.