Data drift in the data science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why drift owns the production loop

The model you shipped on Monday is not the model running on Friday. Inputs shift, user behavior shifts, upstream pipelines silently rename a column, and offline AUC of 0.86 quietly collapses to 0.71 in production with nobody noticing for three weeks. That is why data drift is now a standard mid-loop question at Meta, Stripe, DoorDash, and Anthropic-style ML interviews.

The interviewer is not testing whether you can recite the textbook definition of covariate shift. They want to know if you can build the loop: monitor the right features, pick the right test, set a threshold that does not page on-call at 3 a.m. for noise, and decide whether to retrain, rollback, or fix the pipeline. This is the difference between a junior who reads about MLOps and a senior who owns a model in production.

Coverage: definitions of the three drift families, the detection toolkit, PSI vs KL, the retraining playbook, and the pitfalls that catch candidates.

Three flavors of drift

The single most common interview mistake is conflating data drift with concept drift. They are different problems with different fixes, and senior interviewers will not let you slide on this.

Data drift (covariate shift). The distribution of inputs P(X) changes, while the conditional P(Y|X) stays the same. The classic example: your recommender model trained on US users now serves traffic from a new market launch in Brazil. Feature distributions move; the underlying physics of the problem do not. A well-generalized model may absorb this without much damage.

Concept drift. The conditional P(Y|X) changes — the mapping from features to target. A fraud model trained in 2019 sees the same transaction_amount and merchant_category features, but fraudsters changed tactics. The features look identical; the label they predict no longer holds. Concept drift is the dangerous one because the features themselves can look perfectly stable while the model rots.

Label drift (prior shift). The marginal P(Y) changes — class balance shifts. A churn model trained when monthly churn was 3% now serves a market where churn jumped to 8% after a pricing change. Calibrated probabilities drift even if P(X|Y) is stable.

Drift type What moves Features look Fix
Data drift P(X) Different Retrain or extend training set
Concept drift P(Y | X) Same Retrain with fresh labels, possibly new features
Label drift P(Y) Same Recalibrate, reweight, retrain

Load-bearing trick: if your features look stable but performance is sliding, you are almost certainly looking at concept drift — and no amount of input monitoring will catch it. You need label feedback or a proxy.

Detection methods that interviewers ask for

Detection splits into four families. A strong answer mentions at least two and explains when each is the right tool.

Statistical hypothesis tests. Kolmogorov-Smirnov for continuous features, chi-square for categoricals, Mann-Whitney U when you care about the median. Compute the test between a reference window (last training set) and a current window (last 7 days). A p-value below 0.05 flags the feature. The catch: with millions of rows, every test is significant on something. Pair tests with effect-size cutoffs.

Distance metrics. PSI, KL divergence, Jensen-Shannon, and Wasserstein give you a scalar to threshold and chart. PSI is the industry default in credit risk. KL is information-theoretic and asymmetric. Wasserstein behaves well on continuous shifts and is robust to binning choices.

Classifier-based detection. Train a binary classifier to discriminate reference from current. If held-out ROC-AUC > 0.7, the distributions are distinguishable. The killer feature: importances from this discriminator tell you which columns moved — the multivariate detector that catches drifts univariate tests miss.

Performance monitoring with delayed labels. When ground truth arrives with lag (clicks in minutes, conversions in days, fraud in months), monitor live AUC, precision-at-k, log-loss, or business metrics directly. The gold standard when you can afford the latency. Pair with distribution-level early warnings so you are not waiting two weeks for confirmation.

PSI and KL divergence side by side

These two come up in nearly every drift round. Know them by heart.

PSI (Population Stability Index). Bin the variable into 10 to 20 buckets, compute the share in each bucket for reference and current, and sum.

PSI = Σ (p_new(i) - p_old(i)) · ln(p_new(i) / p_old(i))

Interpretation cutoffs you can quote in an interview:

  • PSI < 0.10 — no meaningful change.
  • 0.10 ≤ PSI < 0.25 — moderate shift, investigate.
  • PSI ≥ 0.25 — major drift, action required.

PSI is symmetric (sort of — formally it is KL(P||Q) + KL(Q||P)), interpretable, bounded for practical purposes, and has industry-accepted thresholds. That last point is why credit risk teams love it: you do not have to defend the threshold to a compliance committee.

KL divergence. The information-theoretic distance between distributions:

KL(P || Q) = Σ P(x) · log(P(x) / Q(x))

KL is asymmetric: KL(P||Q) ≠ KL(Q||P). It blows up when Q(x) = 0 and P(x) > 0, which is why you smooth with a small epsilon or switch to JS divergence (JSD = 0.5·KL(P||M) + 0.5·KL(Q||M), symmetric and bounded by log 2). Wasserstein distance is another popular alternative because it handles continuous distributions without binning and respects the underlying geometry.

Quick example in Python:

import numpy as np

def psi(expected, actual, bins=10):
    breakpoints = np.linspace(0, 1, bins + 1)
    expected_pct = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_pct = np.histogram(actual, breakpoints)[0] / len(actual)
    # smooth zeros
    expected_pct = np.where(expected_pct == 0, 1e-6, expected_pct)
    actual_pct = np.where(actual_pct == 0, 1e-6, actual_pct)
    return np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))

Sanity check: before alerting on PSI, run a stability test on two consecutive reference windows from the same historical period. If PSI between two pre-deployment weeks is already 0.15, your bin count is too low or your feature is naturally noisy — recalibrate the threshold before shipping monitoring.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What to do when drift fires

Detection is the cheap part. The expensive part is the response, and this is where the interview pivots into a systems question.

Investigate before retraining. Which feature moved, and why? A drift on country_code because marketing launched a new region is different from a drift on transaction_amount because an upstream ETL started writing cents instead of dollars. The second is a bug, and retraining on bad data makes it worse. Rank features with the classifier-based detector, then trace each suspect upstream.

Retrain on fresh data. If drift is genuine and labels exist, retraining on a recent window is usually right. Validate offline on a recent holdout before promoting. Sanity check: if retraining on the last 30 days underperforms the current model on the last 7, your detector is firing on noise.

Online or incremental learning. For high-velocity streams (ad ranking, fraud) where nightly is too slow, partial fits keep the model fresh. The trade-off is stability — a noisy day can poison the model — so pair with learning-rate decay and shadow scoring.

Feature engineering update. When drift is systematic (seasonality, macro shifts, regime changes), add features that explain the shift — time-of-week, holiday flags, macro indices, deployment-version tags. The model then absorbs the drift as signal.

Sliding window training. Train on the last N days instead of all history. Pick N by cross-validating on rolling origins. Standard at companies with seasonal cycles like DoorDash and Uber.

Tooling and monitoring stack

You should be able to name two open-source tools and one commercial vendor without breaking eye contact.

Evidently AI is the most common open-source choice — HTML reports, integrates with Airflow, Prefect, Dagster. Whylogs summarizes each batch into a compact statistical profile and compares over time, cheap to ship over the wire.

Arize, Fiddler, and Aporia lead the commercial side with managed dashboards, alerting, and root-cause analysis. MLflow plus a custom drift script is the standard for teams already on MLflow. Vertex AI Model Monitoring and SageMaker Model Monitor are cloud-native if you live inside one ecosystem.

Tool Open source Best for
Evidently AI Yes Batch jobs, scheduled reports
Whylogs Yes Streaming, low-bandwidth profiles
Arize / Fiddler No Managed dashboards, alerting
MLflow + custom Yes Teams already on MLflow
SageMaker / Vertex No Cloud-native, single-vendor stacks

The honest answer: most production teams write a thin custom layer on top of one of these because every product has weird business-logic checks the off-the-shelf tools do not cover.

Common pitfalls

When teams set up drift monitoring for the first time, the most common mistake is waiting for the metric to drop. By the time accuracy or revenue moves visibly, users have been getting bad predictions for days or weeks. The fix is to layer distribution-level monitoring on top of performance monitoring so you see the input shift before the output shift, then triage which alerts are worth waking someone up for.

A second trap is alerting on every change. With dozens of features and daily windows, you will get drift signals every single day. Suppress noise with effect-size thresholds (PSI ≥ 0.25, not just p < 0.05), multi-window confirmation (a feature has to be flagged for 3 consecutive days), and severity tiers. The team should triage drift alerts the same way they triage SEV-3 versus SEV-1 incidents.

The third pitfall is monitoring only the global distribution. A model can look healthy on the population while a critical subgroup — your top 1% revenue customers, or one specific country — is severely drifted. Segmented monitoring on the slices that drive business value catches this. This is also why dashboards that show only headline AUC miss the silent regressions.

The fourth is leaning entirely on ground truth. Labels often arrive with multi-day or multi-week lag — payment defaults, lifetime churn, fraud chargebacks. If you wait for labels to confirm a regression, the damage is done. Distribution-based early warnings buy you the lead time labels cannot give you.

The fifth is assuming drift always means retrain. Often the root cause is a pipeline bug, a schema change upstream, or a logging mistake. Retraining on corrupted data produces a worse model and burns a model-deployment slot. Always investigate the source of the drift before you touch the training pipeline.

If you want to drill drift-detection and MLOps interview questions every day, NAILDD ships hundreds of ML system-design problems built around exactly this pattern.

FAQ

Can I use KS test for multivariate drift detection?

KS is strictly univariate — it compares two empirical CDFs on a single dimension. For multivariate drift, switch to Maximum Mean Discrepancy (MMD) with a Gaussian kernel, energy distance, or the classifier-based approach. MMD is a proper distance and works directly on raw feature vectors without binning. The classifier approach is easier to explain to non-ML stakeholders because feature importances tell you which columns moved.

How big should the reference and current windows be?

Big enough that the test has power, small enough to catch drift before damage. Starting point: reference is the last training set (weeks to months), current is a rolling 1 to 7 days. With huge traffic, 1 day is plenty. With a thousand requests a day, you need a longer window to avoid false negatives. Validate by injecting known drift and checking detection works.

Is data drift the same as out-of-distribution detection?

They overlap but differ. Data drift is a population-level question: has the input distribution shifted between two windows? OOD detection is per-sample: is this request outside the training distribution? Use drift detection for monitoring and retraining triggers, OOD for per-request safety (reject, fall back, escalate). OOD often uses energy scores, Mahalanobis distance, or learned density estimators.

How often should I check for drift in production?

Daily for batch pipelines. Hourly or per-batch for streaming systems. The cost is compute and alert fatigue, not math — PSI on 50 features takes seconds. The harder question is the action SLA: if you detect drift, how fast can you investigate and retrain? Match cadence to response capacity; hourly checks when retraining takes two days is theatre.

What is the difference between drift and seasonality?

Seasonality is predictable, recurring — weekday vs weekend, December vs February. Drift is unexpected change beyond seasonal expectation. Fix: model the seasonality explicitly (time-of-week features, holiday flags, year-over-year baselines) so the detector compares against a seasonally-adjusted baseline. Otherwise every Monday morning looks like drift.

Is this guidance vendor-neutral?

Yes. The detection methods (KS, chi-square, PSI, KL, classifier-based) are standard across the MLOps literature, and the tooling list reflects the open-source and commercial leaders in 2026. PSI cutoffs are credit-risk conventions adopted broadly; adjust for your data volume before shipping.