Class imbalance in DS interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why imbalance breaks naive models

Picture fraud detection at Stripe: 99.9% legitimate, 0.1% fraud. A model that predicts "all clean" lands at accuracy 99.9% — zero business value. This is the canonical DS interview opener: "a classifier has 99% accuracy on a 99/1 split — is that good?" The panel wants a short, brutal answer: useless, because predict(majority) matches it. Then they probe whether you can name precision, recall, PR-AUC without scrambling.

The cost of not knowing this surfaces fast on the job. A DS at Airbnb ships a churn model with AUC 0.92 and accuracy 96%. A week later the PM checks the dashboard — recall on the churned class is 7%, retention is missing almost every at-risk user, and the model goes back to the drawing board.

Imbalance is the default state in most production ML: fraud, churn, rare conditions, equipment failure, click prediction, cold-start recsys. No silver bullet. Just a toolkit — and the interview tests whether you know when to reach for each tool.

Load-bearing rule: never report accuracy alone on imbalanced data. If you say "accuracy" without immediately following with PR-AUC, recall, or a confusion matrix, your interviewer is already lowering the level on the rubric.

Metrics that survive imbalance

Accuracy is the first metric to drop. A trivial predict(majority_class) baseline beats most early models on accuracy, which means accuracy stops telling you anything about model quality. The metrics that do survive are the ones built around the minority class explicitly.

Metric What it measures When to use it
Precision Of predicted positives, how many are real FP is expensive (wrongly blocking a payment)
Recall Of real positives, how many caught FN is expensive (missing fraud, missing cancer)
F1 Harmonic mean of P and R Symmetric cost or unknown business weight
F-beta Weighted P/R tradeoff F2 favors recall, F0.5 favors precision
PR-AUC Area under precision-recall curve Strong imbalance, minority class matters most
ROC-AUC Area under ROC curve Mild imbalance or comparing models broadly
Cohen's kappa Accuracy adjusted for chance agreement Multi-class with skewed priors

A frequent follow-up: "why prefer PR-AUC over ROC-AUC on 99:1?" Because ROC-AUC stays optimistic — the huge true-negative pool drowns out FPs in the FPR denominator. PR-AUC only uses TPs and FPs, so it tracks how the model behaves on the class you actually care about.

Pick the metric from the business cost. FP expensive (wrongly freezing an account) means optimize precision. FN expensive (missing fraud) means optimize recall. Symmetric cost means F1. Always inspect the confusion matrix — a single scalar hides which cell is bleeding.

Resampling: oversample vs undersample

Oversampling duplicates minority examples until classes are balanced. It's the cheapest fix and the one that breaks first under scrutiny.

from sklearn.utils import resample
df_minority_upsampled = resample(
    df_minority, replace=True, n_samples=len(df_majority), random_state=42,
)
df_balanced = pd.concat([df_majority, df_minority_upsampled])

The downside is overfitting: the model sees the same minority rows dozens of times and learns their idiosyncrasies, not the underlying pattern. Tree-based models tolerate this better than logistic regression, but it's still a smell.

Undersampling drops majority rows until classes match. On large datasets (think DoorDash order logs with billions of rows) this is fine — you lose redundant signal, not unique signal. On small datasets it's reckless: throwing away 90% of a 10k-row training set to balance against 1k minority leaves you with 2k total rows and a fragile model.

df_majority_downsampled = resample(
    df_majority, replace=False, n_samples=len(df_minority), random_state=42,
)
df_balanced = pd.concat([df_majority_downsampled, df_minority])

The non-negotiable rule: resample only on the training set, after the split. Validation and test must stay in the original distribution, otherwise your metrics describe a fantasy world that doesn't exist in production.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
X_train_res, y_train_res = oversample(X_train, y_train)  # train only
model.fit(X_train_res, y_train_res)
metrics = evaluate(model, X_test, y_test)  # test stays imbalanced

If the interviewer asks you to "fix imbalance" and you reach for df.sample() before the split, the loop is effectively over.

SMOTE and its variants

SMOTE (Synthetic Minority Oversampling Technique) generates new minority points by interpolating between a minority row and one of its k nearest neighbors of the same class. Instead of duplicating, you get synthetic neighbors that fill in the feature space.

from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

The strengths: no duplication, smoother decision boundary, often a real lift on logistic regression and shallow trees. The weaknesses are sharper than people admit. Linear interpolation in feature space is meaningless for categorical features — interpolating between "country=US" and "country=DE" gives you a non-existent country. And on noisy minority points, SMOTE amplifies the noise by sampling around it.

The variants address specific failure modes:

  • Borderline-SMOTE — synthesizes only near the decision boundary, where it matters
  • SMOTE-NC — handles mixed numeric and categorical features without producing nonsense
  • ADASYN — adaptive density, generates more synthetic points where the minority is "hard"
  • SMOTE + Tomek links — pairs oversampling with cleanup of borderline majority points; often the best combination on tabular data

Gotcha: never apply SMOTE before train/test split. The synthetic points leak information about the test distribution through the nearest-neighbor lookup, and your reported PR-AUC becomes a fiction.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Class weights and cost-sensitive learning

Instead of touching the data, you can touch the loss. Weighting the minority class higher in the loss function penalizes the model more for mispredicting it. No duplication, no synthetic noise, no dropped data — which is why on modern gradient boosters this is often the first thing to try.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced', max_iter=1000)
# 'balanced' computes w_c = n_samples / (n_classes * count(c))

balanced is the lazy default: a class at 1% prevalence gets a weight roughly 99x the majority. If your business cost is asymmetric — say, an FN on fraud costs 10x more than an FP — pass explicit weights.

model = LogisticRegression(class_weight={0: 1, 1: 10})

For XGBoost and LightGBM the parameter is named differently but the idea is identical:

from xgboost import XGBClassifier
ratio = (y_train == 0).sum() / (y_train == 1).sum()
model = XGBClassifier(scale_pos_weight=ratio, eval_metric='aucpr')

For PyTorch, the weight goes directly into the loss tensor:

import torch
import torch.nn as nn
loss_fn = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 10.0]))

On tabular data with gradient boosters, class weights often match or beat SMOTE without inflating training time. The model gets a calibrated objective and you skip the synthetic-data debate entirely.

Focal loss and threshold tuning

Focal loss (Lin et al., 2017, originally for object detection in RetinaNet) is a re-weighting of cross-entropy that down-weights easy examples and focuses gradient on hard ones:

FL = -alpha * (1 - p_t)^gamma * log(p_t)

The (1 - p_t)^gamma term is the focusing factor: when the model is already confident and correct, (1 - p_t) is near zero, and that example contributes almost nothing to the loss. Hard, ambiguous examples — the ones near the boundary — dominate the gradient. gamma=2 is the typical setting; gamma=0 reduces to plain cross-entropy.

Focal loss shines on extreme imbalance in deep learning (object detection, segmentation, sometimes NLP with very rare labels). On classic tabular ML it's overkill — class weights do the same job with less ceremony.

Threshold tuning is the most underrated trick in this entire toolkit. A binary classifier outputs probabilities, and the default of proba > 0.5 is almost never optimal on imbalanced data. Train the model normally, then sweep the threshold on validation to maximize whatever you actually care about.

import numpy as np
from sklearn.metrics import precision_recall_curve

proba = model.predict_proba(X_val)[:, 1]
prec, rec, thr = precision_recall_curve(y_val, proba)
f1 = 2 * prec * rec / (prec + rec + 1e-9)
best_thr = thr[np.argmax(f1[:-1])]
print(f"Best threshold: {best_thr:.3f}, F1: {f1.max():.3f}")

Tuning the threshold costs nothing, breaks nothing, and often delivers the biggest single jump in business metric. If your interviewer asks for one technique they should always try first, this is the answer.

Common pitfalls

The single most common pitfall is resampling before the train/test split. When SMOTE runs on the full dataset, synthetic points generated near test rows leak information about the test distribution into training. Your reported PR-AUC looks excellent in the notebook and collapses in production. The fix is mechanical: split first, resample the training fold only, and never touch validation or test.

A second trap is reporting accuracy and stopping there. On a 99:1 split, accuracy 99% is the baseline floor, not a result worth showing. Always pair accuracy with precision, recall, the confusion matrix, and PR-AUC. If the panel sees only an accuracy number in your slide, they assume you don't understand the problem.

Applying SMOTE to categorical features without SMOTE-NC is a silent killer. Linear interpolation between two one-hot encoded categories produces a vector that doesn't correspond to any real category, and the model learns from this garbage. If your features include categoricals, either use SMOTE-NC, encode categoricals as embeddings first, or skip SMOTE entirely and use class weights.

Balancing to 1:1 when production is 99:1 breaks calibration. The model now believes the world is 50/50 and outputs probabilities that overestimate the minority class everywhere. If you need calibrated probabilities — for expected-value decisions, for thresholding by cost — either keep the original ratio and use class weights, or wrap the model in CalibratedClassifierCV after training.

Ignoring the business cost of FP vs FN is a meta-pitfall. Optimizing F1 when an FN is genuinely 10x more painful than an FP wastes model capacity. The first thing to ask a PM at Netflix or DoorDash before training: "what does each error cost in dollars or user trust?" That number sets the metric, the threshold, and the class weights all at once.

Finally, the seductive belief that SMOTE is always better. On modern gradient boosters (XGBoost, LightGBM, CatBoost) with scale_pos_weight, you typically match or beat SMOTE with one parameter and no synthetic data. SMOTE is a tool, not a default.

If you want to drill imbalance scenarios and ML interview questions until the patterns are automatic, NAILDD has 1500+ problems with this exact shape across DS, ML, and SQL.

FAQ

When should I use resampling versus class weights?

Class weights are the cheaper first move — no data is duplicated, no synthetic rows are invented, training time barely changes. Start there. If the result is still weak (low recall, poor PR-AUC), reach for SMOTE on tabular data or undersampling on very large datasets. On gradient boosters specifically, class weights via scale_pos_weight are often all you need and SMOTE rarely adds measurable lift.

What ratio counts as "serious" imbalance?

A rough working scale: 1:10 is mild (most models handle it with minor tweaks), 1:100 is serious (you need explicit handling — weights, resampling, or threshold tuning), 1:1,000 and beyond is extreme (focal loss, anomaly-detection framing, or even reframing as a one-class problem). The exact cutoff depends on dataset size and feature quality — 1:100 on 10M rows is much easier than 1:100 on 10k rows.

Does SMOTE help neural networks?

Less than it helps linear models and shallow trees. Networks can learn useful representations on their own, and synthetic interpolation in raw feature space often confuses them. At extreme imbalance, focal loss and class weights in the loss function tend to outperform SMOTE. SMOTE on embedding-space (after a learned representation) sometimes works, but it's rarely worth the complexity in an interview answer.

How do I handle multi-class imbalance?

Use class_weight='balanced' for a global weighting, or pass explicit per-class weights when costs vary. For metrics, macro-F1 averages F1 equally across classes (treats rare classes as important as common ones), while weighted-F1 scales by class size (closer to accuracy). On multi-class SMOTE, set sampling_strategy='auto' to balance every minority class against the largest one.

Can threshold tuning replace resampling entirely?

In many cases, yes. Train the model on the original imbalanced data, get calibrated probability outputs, then sweep the threshold on validation to maximize the business metric (max F1, max recall at precision ≥ X, etc.). You keep calibration intact, you skip the resampling debate, and you get a tunable knob for ops to adjust later. It won't replace SMOTE on truly tiny minority classes where the model struggles to learn at all, but for the common 1:50 to 1:200 range it's often enough.

Is this guidance authoritative?

No. It's a synthesis of common interview patterns and the foundational papers — Chawla et al. 2002 for SMOTE, Lin et al. 2017 for focal loss — combined with imbalanced-learn and scikit-learn docs. Always validate on your specific dataset before shipping.