Cross-validation strategies for the DS interview
Contents:
Why interviewers grill you on CV
When a hiring manager at Stripe, DoorDash, or Netflix asks how you validated a model, they are testing whether you can be trusted to ship without quietly burning revenue. The wrong cross-validation scheme is the fastest way to produce a model that posts 0.95 AUC offline and 0.60 in production. Senior interviewers have all seen this failure, and "how would you validate this?" is how they screen it out.
Three signals separate strong candidates from weak ones. The first is whether you reach for the right scheme by default — random k-fold for i.i.d. tabular classification, stratified for imbalanced labels, time-series split for timestamps, group k-fold when the same entity recurs. The second is whether you can name the specific leakage mode each scheme prevents. The third is whether you understand when CV is overkill and a single hold-out is cheaper and just as honest.
Load-bearing rule: if your data has a timestamp or a repeated entity (user, patient, document), random k-fold is wrong. Default to TimeSeriesSplit or GroupKFold, then justify any deviation.
Strategy comparison at a glance
| Scheme | Use when | Avoids leakage from | Typical k | Cost |
|---|---|---|---|---|
| K-fold | i.i.d. tabular, balanced labels | nothing structural | 5–10 | low |
| Stratified k-fold | classification with imbalance | minority class disappearing from a fold | 5–10 | low |
| Time-series split | timestamps in the data | future leaking into past | 3–10 | medium |
| Group k-fold | same entity appears in many rows | entity memorization | 5–10 | low |
| Leave-one-out | dataset under ~1k rows | none — high variance estimate | n | very high |
| Nested CV | small data + hyperparameter search | tuning bias inflating reported metric | 5×3 typical | very high |
This table is the one slide most candidates wish they had memorized before the on-site.
Plain k-fold
The default. Split the dataset into K equal parts, train on K-1, evaluate on the remaining one, rotate K times, average the metric. The output is one point estimate with a standard deviation — both numbers matter, because a 5-fold AUC of 0.82 ± 0.01 is a very different result from 0.82 ± 0.09.
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='roc_auc')
print(scores.mean(), scores.std())Why shuffle=True is not optional. If rows arrived sorted by date or class, folds will be systematically biased. Shuffling is the fix for tabular i.i.d. data. The exception is time-series, where shuffling is itself the problem.
Choosing K. K=10 lowers bias because each training fold sees 90% of the data, but costs twice as much compute. K=5 is the sweet spot for most answers — fast, robust, and acceptable variance on anything above a few thousand rows.
Stratified k-fold
The moment your label distribution drifts from 50/50, plain k-fold produces folds that are representative on average but unbalanced individually. A fraud dataset with a 0.5% positive rate on 10,000 rows leaves each fold with roughly 10 positives — the variance of that count alone swings AUC by 3-4 points fold to fold for reasons unrelated to the model.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
...Stratified k-fold preserves the class proportions in every fold. It is the silent default for any sklearn classifier passed through cross_val_score. The interview answer to know: yes, it's the sklearn default for classifiers, but I would pass it explicitly to make the choice auditable.
When you have both class imbalance and group structure — fraud prediction where the same merchant appears across rows — sklearn ships StratifiedGroupKFold since version 1.0. It preserves class proportions while keeping groups intact, and naming it is a strong senior signal.
Time-series CV
The leakage trap that single-handedly produces the most catastrophic offline-versus-online metric gaps. If your data has a timestamp and your model will be deployed forward in time, you cannot shuffle. Random k-fold lets the model train on Tuesday and Friday to predict Wednesday, which is impossible in production.
The correct scheme is an expanding (or sliding) window:
fold 1: [train: 1-100 ][val: 101-110]
fold 2: [train: 1-110 ][val: 111-120]
fold 3: [train: 1-120 ][val: 121-130]
fold 4: [train: 1-130 ][val: 131-140]from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit(n_splits=5, gap=7, test_size=10)
for train_idx, val_idx in tss.split(X):
...Expanding versus sliding window. Expanding keeps growing the training set, honest for systems that retrain on all history. Sliding holds training at fixed length, appropriate when older patterns no longer apply (concept drift, regime change).
The gap parameter. If your target is known with a 7-day lag — a delinquency label, a chargeback — then the last 7 days of training are leaking the future of validation. Setting gap=7 enforces a buffer. López de Prado's Advances in Financial ML generalizes this into "purged" and "embargo" k-fold, worth naming at Two Sigma or Citadel.
Feature-side leakage. The split is not enough. A rolling 30-day mean computed before splitting leaks future into past. Compute lag features, rolling stats, and target encodings inside each fold. This is where most answers fall apart even after candidates correctly name TimeSeriesSplit.
Group k-fold
When the same entity appears multiple times in your data, random splits will place some of its rows in training and some in validation. The model then memorizes the entity rather than learning the underlying pattern, and your CV metric overstates generalization by a wide margin.
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=user_ids):
...The canonical cases are not subtle once listed. Multiple purchases per customer in a conversion model. Multiple sessions per user in churn. Multiple slides per patient in medical imaging. Multiple listings per host on Airbnb. Multiple rides per driver at Uber. The right unit of split is the entity, not the row.
The interview answer: I would split by the entity that the model will see fresh in production. If you score new users at Notion, split by user. If you score new merchants at Stripe, split by merchant. The right answer is defined by the production scoring boundary, not by what is convenient at modeling time.
Leave-one-out and leave-p-out
LOO is k-fold with K = N. Every observation gets to be the validation set exactly once. It minimizes bias of the performance estimate (the model trains on N-1 rows) at the cost of maximum variance and N times the training cost.
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()The honest answer is that LOO is rarely the right choice on a modern dataset. Under 1,000 rows with expensive labeling — a clinical trial, a manually-annotated benchmark — LOO earns its place. Above 10,000 rows, 5-fold or 10-fold dominates LOO on every axis except theoretical bias. Naming LOO and then explaining why you would not use it is the strong-candidate move.
Leave-p-out generalizes by holding out p observations per iteration. It is combinatorially expensive and effectively never used in industry.
Nested CV for hyperparameter tuning
If you use the same CV split to choose hyperparameters and to report performance, the reported metric is optimistically biased — you have implicitly tuned to the validation folds. Nested CV fixes this by wrapping an inner CV (for tuning) inside an outer CV (for reporting).
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
clf = GridSearchCV(estimator=model, param_grid=grid, cv=inner_cv)
nested_scores = cross_val_score(clf, X, y, cv=outer_cv)
print(nested_scores.mean(), nested_scores.std())Nested CV is mandatory when publishing, when tuning bias is large relative to true error, or when an Anthropic or OpenAI interviewer probes for it. On industrial datasets with hundreds of thousands of rows, a train / validation / test three-way split is cheaper and just as honest — tune on validation, report on test, never touch test again. Picking the right one for the scale of data is the senior answer.
Common pitfalls
Random shuffling on time-series data. The number-one cause of inflated offline metrics in production. The candidate who says "I ran 5-fold and got 0.93 AUC" on a churn model with timestamps has failed the loop. Reach for TimeSeriesSplit whenever a timestamp exists, and explain why before being asked.
Skipping stratification with imbalanced labels. With a positive rate under 5%, a non-stratified fold can have one or two positives or zero. The metric becomes unstable for reasons unrelated to the model. Stratified k-fold is free and the sklearn default — there is no excuse not to pass it explicitly.
Splitting users across train and validation. Group leakage. The model learns this specific user converts instead of users with these features convert, inflating CV by however much user identity explains the target. Ask yourself: what is the unit of generalization at scoring time? Split on that.
Fitting preprocessing before the CV loop. Running StandardScaler.fit(X) on the full dataset before cross_val_score leaks statistics from validation into training. Wrap preprocessing and model in a Pipeline so fit is called inside each fold — make_pipeline(StandardScaler(), model) is the one-liner.
Tuning against the test set. If you re-tune three times on test, test has stopped being honest. Touch test exactly once for the metric you report. Iterate on validation.
Ignoring gap in time-series CV. A delinquency label with a 30-day lag means training contains rows whose labels you would not have known at scoring time. Set gap to at least the label resolution period.
Treating LOO as the gold standard. A textbook reflex that does not survive industrial data. LOO has higher variance than 5- or 10-fold on realistic data and costs N times more compute. Use it only under a thousand rows.
Forgetting random_state. Two runs producing different numbers is not science. Fix random_state in every splitter, shuffler, and model that takes a seed.
Related reading
- Bayesian methods — data science interview
- Confidence intervals — data science interview
- Regression metrics — data science interview
- Feature store — data science interview
If you want to drill data-science interview questions like these every day, NAILDD is launching with 500+ ML problems spanning exactly this kind of validation question.
FAQ
How many folds should I pick?
Five is the default for almost any industrial dataset and the answer to give unless the interviewer pushes back. Ten lowers bias slightly at twice the cost — worth it when the dataset is small (under a few thousand rows) and the metric variance matters. Three is fine on huge datasets where each fold is already statistically large and compute is a real constraint. The number itself is rarely the most interesting part of the answer; the scheme is.
Does stratified CV exist for regression?
Yes, by bucketing the target. Bin y into quartiles or deciles, then stratify on the bin index. This is especially useful when the target is heavily skewed — log-transformed revenue, lifetime value, latency tail metrics — because random folds can otherwise leave you with one fold that has all the high-value outliers. StratifiedKFold does not do this for you automatically with continuous targets, so you build the bins manually and pass them as the stratification key.
Can I use plain k-fold on time-series data?
Only if the prediction task is not forward-in-time. If you are predicting a property of a user that does not depend on the order of events (geographic segment, language preference, propensity within a fixed window with no future leakage), random k-fold is fine. The moment the target is what happens next, the answer is no.
What is purged k-fold?
A time-series CV variant from quantitative finance that adds a purge (removing training samples whose labels overlap the validation window) and an embargo (a buffer period after validation where training samples are also dropped). The name is from López de Prado's Advances in Financial Machine Learning, and it is the right answer for any role at a hedge fund or quant trading desk. Naming it correctly is a senior-level signal in those interviews.
My CV metric is much higher than production performance — what happened?
The shortlist is: group leakage (same user in train and val), time leakage (random shuffle on temporal data), feature leakage (target-derived feature computed on the full dataset), distribution shift between training data and production traffic, or label drift over time. Run adversarial validation — train a classifier to distinguish training rows from production rows; if it succeeds, your data has shifted. Re-audit the CV scheme against the production scoring boundary.
Is this content official?
No. This guide is based on the scikit-learn documentation, Hastie, Tibshirani, and Friedman's Elements of Statistical Learning, and López de Prado's Advances in Financial Machine Learning, distilled into interview-relevant patterns. Always cross-check the latest sklearn API for parameter names that may have changed.