Canary and shadow deployment on the DS interview
Contents:
Why interviewers care
Picture this: you joined a recsys team at Netflix, your new ranking model beat the old one on offline metrics by +3.2% NDCG, and the staff engineer asks how you would ship it. If your answer is "merge and deploy", the interview is effectively over. Production ML deployment is never a plain replace — it is a sequence of shadow, canary, and guardrail checks that survives the moment your model meets real traffic.
Senior DS and MLE loops at Meta, Uber, DoorDash, Stripe, and Anthropic routinely ask the same three questions: what is the difference between canary and shadow, how do you decide the rollout schedule, and how fast can you roll back when the dashboard turns red. The expected answer is not a definition — it is a deployment playbook with concrete percentages, monitoring windows, and automated triggers.
This guide walks through the four strategies in the order you should mention them on a whiteboard, then bolts on a rollback section that most candidates skip. Read it as the script you will deliver when the interviewer says "okay, the model is trained — now what".
Load-bearing trick: Shadow proves the system can serve the model. Canary proves users will tolerate the model. A/B proves the model is better. They are not interchangeable.
Shadow deployment
In shadow mode the new model runs in parallel with the production model on the exact same incoming requests, but its predictions are never returned to the user. They are logged to a side table for later analysis. The user sees only the legacy response, so there is zero blast radius even if the new model returns garbage.
Request → old model → response (served to user)
→ new model → response (logged only)Shadow buys you three things at once. First, real-traffic latency numbers — the p50, p95, and p99 you cannot get from synthetic load tests because shadow traffic has the same skew and tail-key distribution as production. Second, prediction parity — for every shadow log row you can diff old_pred against new_pred and quantify how much the new model would have changed user-visible behavior. Third, dependency stress — feature lookups, model server cold starts, and downstream batchers are all exercised under realistic load.
The cost is brutal but honest: 2× compute for the duration of the shadow window. At Stripe-scale this is a real budget line item, which is why most teams cap shadow at 3 to 7 days before promoting to canary or shutting it down.
Shadow is the only deployment mode that lets you compare predictions on the same input — A/B and canary split traffic, so you never see the same request scored by both models.
Canary deployment
Canary routes a small percentage of real traffic to the new model — typically 1%, then 5%, then 10%, then 25%, then 50%, then full ramp — with a 30 to 120 minute soak at each stage. If guardrail metrics stay green, the next stage triggers automatically. If anything flips red, traffic snaps back to the old model.
Stage 1: 99% old / 1% new → soak 60 min → check guardrails
Stage 2: 95% old / 5% new → soak 60 min → check guardrails
Stage 3: 90% old / 10% new → soak 2 hrs → check guardrails
Stage 4: 75% old / 25% new → soak 4 hrs → check guardrails
Stage 5: 50% old / 50% new → soak 12 hrs → check guardrails
Stage 6: 0% old / 100% newThe point of canary is bounded blast radius. A model that quietly returns NaN, a feature pipeline that timed out and silently returned the default, an embedding table that loaded the wrong shard — all of these would scorch production in a one-shot deploy. Under canary, at most 5% of users ever see the bug before the auto-rollback fires.
Guardrail metrics on the canary slice fall into three layers. Business metrics — click-through rate, conversion, GMV per session — answer "did the model help". System metrics — error rate, latency p99, timeout rate — answer "is the model healthy". Model-specific metrics — prediction distribution drift, feature-value drift, calibration — answer "is the model behaving like it did in offline eval".
Sanity check: If you cannot name the three guardrails that block stage promotion before you start the canary, you are not ready to deploy. Write them in the deploy ticket.
Blue-green for ML
Blue-green keeps two complete environments running — blue (current) and green (new) — and flips a router between them. There is no traffic split. The cutover is instant for 100% of users, and rollback is symmetric: flip the router back. Total switch latency is usually under 30 seconds.
For stateless services (web servers, REST APIs, batch jobs) blue-green is excellent. For ML it is rarely the right default because you lose the gradual rollout that catches subtle behavior changes. Blue-green makes sense when you are swapping infra, not behavior — moving from one model server framework to another with the same model artifact, or migrating from CPU to GPU inference where you trust the model itself.
A common hybrid is blue-green for the model server, canary for the weights: the router cuts over to the new fleet, then new weights are introduced via canary on top.
A/B testing for ML
A/B is a statistically powered experiment with random assignment, a primary KPI, guardrail metrics, and a pre-registered analysis plan. Treatment and control are usually 50/50, sometimes 90/10 if the team is risk-averse.
The thing interviewers love to probe is the canary versus A/B distinction. Canary is a safety mechanism — it asks "will this break production". A/B is an evidence mechanism — it asks "is the new model actually better than the old one at the metric we care about". You can run them in sequence: canary at 5% to confirm safety, then A/B at 50% to confirm lift.
Duration matters more for ML than for UI experiments. A button-color test reaches significance in 2 weeks because the signal-to-noise ratio is huge. A recsys ranking change often needs 3 to 6 weeks because the primary KPI has high variance, user re-engagement cycles span days, and novelty effects distort week-one numbers. CUPED — covariate-adjusted pre-experiment data — is the standard variance-reduction lever and typically cuts required sample size by 30 to 50%.
Common ML A/B traps the interviewer will fish for: interaction effects when the new ranker changes the candidate pool the next stage sees, simpson's paradox when a model wins on aggregate but loses on every segment, and long-term metric drift when click-rate goes up but 30-day retention goes down.
Rollback strategy
The rollback section is where most candidates lose points. Rollback must be automated, fast, and rehearsed. The number to anchor on is under 5 minutes from alert to traffic restored. Manual rollback that requires a human in Slack to redeploy is not a rollback strategy — it is a postmortem waiting to happen.
Auto-rollback triggers should fire when any of these breach for two consecutive minutes:
- Error rate on the new model fleet > 2× baseline
- Latency p99 > 1.5× baseline
- Primary business metric drop > 5% with p < 0.01
- Prediction-distribution KL divergence > a pre-set threshold
- Feature-pipeline staleness > the SLA on any feature in the top-10 by importance
The mechanism is almost always a feature flag, not a redeploy. A boolean like use_new_ranker_v2 = false flips traffic back in under 10 seconds because the old model is still loaded in memory. Redeploying the old image takes 3 to 8 minutes — too slow for a real incident.
| Trigger condition | Detection latency | Action | Restore latency |
|---|---|---|---|
| Error rate spike (>2× baseline) | 60–120 sec | Flag flip | <10 sec |
| Latency p99 spike (>1.5× baseline) | 60 sec | Flag flip | <10 sec |
| Business KPI drop (p<0.01) | 5–30 min | Flag flip | <10 sec |
| Prediction drift (KL > threshold) | 15–60 min | Flag flip | <10 sec |
| Full infra failure | 30 sec | Blue-green swap | <30 sec |
The rehearsal part matters too. Teams at Netflix run gamedays where they intentionally trigger a rollback in production to verify the alerting wired through and the flag flipped. Mentioning gamedays in the interview signals you have been on call.
Strategy comparison table
| Strategy | User impact | Compute cost | Use when | Risk level |
|---|---|---|---|---|
| Shadow | Zero — predictions logged only | 2× baseline | Validating latency, parity, infra readiness | Lowest |
| Canary | 1–25% see new model | ~1.05× baseline | Default ML deploy with gradual ramp | Low |
| Blue-green | 100% instant cutover | 2× during transition | Infra swap, framework migration | Medium |
| A/B test | 50/50 split, long duration | ~1× baseline | Proving lift on a primary KPI | Low–medium |
| One-shot replace | 100% instant | 1× | Never (this is the wrong answer) | High |
Common pitfalls
The first trap is trusting staging traffic. Staging usually has synthetic load, a fraction of real QPS, and a feature distribution that diverged from production weeks ago. A model that passes every staging gate can still fail on real traffic because the long-tail of feature values only exists in production. The fix is to make shadow deployment a non-negotiable step between staging and canary, even for small model changes.
The second trap is shipping without guardrails. A canary that ramps purely on a timer with no metric checks is just a slow one-shot deploy. Each stage promotion must depend on a green guardrail check on the canary slice, and the guardrails must be defined before you start — not invented on the fly when something looks weird.
The third trap is letting shadow run forever. Shadow costs 2× compute for as long as it runs, and the longer it runs the more tempting it is to keep adding "one more thing to compare". After 7 days max you should either promote to canary or kill the deployment. Indefinite shadow is a budget line item your finance partner will eventually notice.
The fourth trap is comparing against the wrong baseline. Your new model should be benchmarked against the current production model, not the model from six months ago that the team remembers as "the last good one". Production drifts. Features get added. The control variant in your A/B must be today's prod, end of story.
The fifth trap is monitoring only the model. If the feature pipeline silently fails and starts serving stale or default values, the model will return predictions that look reasonable but are based on lies. Feature freshness belongs in the same guardrail dashboard as model error rate. Some of the worst recsys incidents came from feature staleness, not bad models.
Related reading
- MLOps on the data science interview
- MLOps monitoring on the DS interview
- Feature Store on the data science interview
- MLflow and DVC on the DS interview
- A/B testing peeking mistake
- CUPED for variance reduction
If you want to drill production-ML interview scenarios — canary ramps, rollback triggers, A/B duration sizing — every day, NAILDD ships 500+ problems including a full MLOps deployment track with realistic guardrail trade-offs.
FAQ
What is switchback testing and when do you use it?
Switchback is an experiment design where the entire user base sees one variant during a time slot — say 30 minutes on treatment, then 30 on control — alternating throughout the experiment. It is the standard A/B replacement for two-sided marketplaces like ride-share or delivery where treatment users affect outcomes for control users through shared supply. Uber and DoorDash rely on switchback because a normal user-split A/B has interference bias so large that the result is meaningless.
Why not one-shot replacement if offline metrics look great?
Offline evaluation operates on a fixed dataset that does not capture distribution shift, infra failures, novelty effects, or feedback loops. A model that wins by +5% AUC offline can still tank CTR by 8% in production. Shadow and canary are how you discover this before the entire user base feels it.
How long should a canary stage soak before promotion?
A reasonable default is 30 to 60 minutes at low traffic stages (1%, 5%) to catch system-level breakage, then 2 to 12 hours at mid stages (10–50%) to evaluate business KPIs. For metrics with diurnal seasonality, you want at least one full daily cycle at the 50% stage before going to 100%.
Can shadow and A/B run simultaneously?
Yes, and at large companies they often do, but on different models at different lifecycle stages. Model v3 is in production, v4 is in A/B at 50/50 against v3, v5 is in shadow against v4. Shadowing the same model you are A/B testing does not work — you double-count traffic and pay 3× compute for one experiment.
What is the right rollback time-to-mitigate target?
Under 5 minutes end-to-end from alert fire to traffic restored, with under 30 seconds being the actual flag flip. Fully automated rollback on tier-1 guardrails like error rate and latency is industry standard; rollback on business KPI usually still requires a human because the false-positive rate is higher.