Feed ranking ML system design for DS interviews
Contents:
Why feed ranking shows up in every DS loop
When a recruiter at Meta, TikTok, Pinterest, Snap, or LinkedIn books a 60-minute ML system design slot, the prompt is almost always the same flavor — "design the For You feed" or "rank stories on the home page". It is the canonical end-to-end recsys problem because it forces you to talk about billion-item catalogs, sub-200ms latency, real-time freshness, and the tension between short-term CTR and long-term retention in a single hour.
The interviewer is not testing whether you can recite a paper. They are checking three things: do you partition the problem into candidate generation → ranking → re-ranking without prompting, do you pick reasonable models and features for each stage, and do you anticipate the failure modes — feedback loops, position bias, clickbait, diversity collapse. Get the structure right in the first five minutes and the rest of the hour is downhill.
This post is the working template I recommend. It is dense by design — system design loops reward candidates who can keep the conversation moving, not those who narrate every option.
Load-bearing trick: when you sketch the architecture, draw the funnel with concrete numbers — 1B → 1,000 → 100 → 10. Interviewers internalize the diagram, then anchor every follow-up question to a specific stage. You control the next 45 minutes.
Framing the problem
Open with assumptions before any modeling. A typical brief looks like a TikTok or Instagram-style feed: show the top-10 items per session per user. Pin the constraints out loud so the interviewer either confirms or corrects you, which saves a wrong-track tangent later.
| Constraint | Typical value | Why it matters |
|---|---|---|
| Catalog size | 1B+ active items | Forces a retrieval layer, not just ranking |
| Concurrent users | ~1M peak QPS | Drives sharding and embedding cache design |
| End-to-end p99 | < 200ms | Splits the latency budget across stages |
| Freshness target | New items live in < 60s | Needs streaming index, not nightly batch |
| Personalization | Per-user, session-aware | Rules out a single global popularity ranker |
State the objective function as early as you state the constraints. "Maximize weighted multi-task score (click + like + share + watch-time) subject to a diversity floor and policy filters" beats "maximize engagement" by a mile — it signals that you have lived inside a real ranking team.
Multi-stage architecture
Every modern feed is a cascade. Each stage trades cost for accuracy. The cheap stage prunes ruthlessly, the expensive stage scores precisely, the final stage handles business rules and diversity.
1B items
↓ Candidate generation (cheap retrieval, ~50ms)
1,000 candidates
↓ Ranking (heavy neural model, ~100ms)
100 ranked items
↓ Re-ranking + diversity + policy (~10ms)
10 final items → userThis funnel is the single most important picture in the whole interview — if the interviewer takes a photo of the whiteboard, it will be this. Walk left to right and explicitly state the latency budget per stage, the model class, and the failure mode you are guarding against.
Candidate generation
The goal of the first stage is brutal pruning: reduce 1B items to roughly 1,000 in under 50ms. You do not need a precise score here, only a high-recall set that contains the eventual winners.
In practice the candidate pool is a union of multiple sources, deduped and merged before they flow into ranking. The model has to recall the right items — precision is the next stage's problem.
| Source | Method | Typical share of pool |
|---|---|---|
| Embedding retrieval | Two-tower DSSM + ANN (HNSW, ScaNN, FAISS) | 40-60% |
| Social graph | Items liked / shared by friends, follows | 10-20% |
| Trending | Globally popular last N hours | 10-15% |
| Fresh | Items published within the last hour | 5-10% |
| Heuristic / editorial | Topic-based, locale-based, cold-start | 5-10% |
The two-tower model is the workhorse: a user tower and an item tower trained with in-batch negatives, embeddings indexed offline in an ANN store, online lookup in single-digit milliseconds. Train it on implicit feedback (clicks, watch-time) with a softmax loss and you get a recall@1000 high enough that the ranker has real signal to work with.
# Two-tower scoring at serve time
user_emb = user_tower(user_features) # 1 x D
candidates = ann_index.query(user_emb, k=1000) # 1000 item ids + scores
candidates = dedupe_and_merge(
candidates,
friend_based_candidates(user_id),
trending_candidates(locale),
fresh_candidates(user_id, last_hour=True),
)Sanity check: if candidate generation recall@1000 drops below the ranker's effective top-k window, no amount of fancy ranking will save you. Always monitor retrieval recall as a guardrail metric, not just final CTR.
Ranking
Stage two carefully scores the ~1,000 candidates with a heavy neural model. Latency budget is roughly 50-100ms, which is enough for a Wide & Deep, DCN-v2, DLRM, or a small Transformer-based interaction model.
Features split into four buckets and the interviewer will ask about each:
- User features — long-term embedding, short-term session embedding, demographics, device, locale.
- Item features — content embeddings (text, image, video), freshness, author reputation, historical CTR with smoothing.
- Cross features — user-author affinity, user-topic affinity, prior interactions, recency of last impression.
- Context features — time of day, day of week, network type, app cold-start vs warm session.
The ranker is almost always multi-task. Predict click, like, share, comment, follow, watch-time, completion-rate as separate heads, then combine them with business-weighted scores into a single final score. Weights are the lever the product team tunes when retention dips or comment volume drops — keep them out of the model and in a config.
# Multi-task scoring with business weights
preds = ranker(user_feats, item_feats, cross_feats, ctx)
# preds = {"p_click": ..., "p_like": ..., "p_share": ..., "watch_sec": ...}
score = (
W_CLICK * preds["p_click"]
+ W_LIKE * preds["p_like"]
+ W_SHARE * preds["p_share"]
+ W_WATCH * log1p(preds["watch_sec"])
)Calibration matters more than people expect. If p_click is 2x miscalibrated relative to p_like, the weights become meaningless and product launches turn into hyperparameter archaeology. Apply isotonic regression or temperature scaling on a holdout, and re-check calibration after every retrain.
Re-ranking and diversity
The last 10ms turn 100 scored items into the final top-10. This is where business rules, diversity, and policy live. Skip this layer and your feed will happily show ten cat videos in a row because the ranker correctly identified that the user loves cats.
The standard moves are MMR (Maximal Marginal Relevance) or DPP (Determinantal Point Processes) to penalize items that are too similar to already-selected items, position-bias correction so that slot 1 does not steal credit from slot 10, freshness boosts for items published in the last few minutes, and policy filters for sponsored slots, regulated content, or creator-fairness caps.
A useful mental model: ranking is what the user probably wants right now, re-ranking is what gives the user a healthy session. Diversity acts like a regularizer on user attention — without it, session length collapses within a week of launch.
Metrics offline and online
You need both an offline scoreboard for model iteration and an online scoreboard for launch decisions. Interviewers love to ask why offline wins do not always translate online — be ready to talk about selection bias, feedback loops, and proxy mismatch.
| Layer | Metric | What it measures |
|---|---|---|
| Offline | NDCG@10, Hit@10, MRR | Ranking quality on logged data |
| Offline | AUC / LogLoss per task | Per-head calibration and discrimination |
| Offline | Retrieval Recall@1000 | Candidate-gen recall before ranking |
| Online | CTR, like-rate, share-rate | Short-term engagement |
| Online | Watch-time, completion-rate | Depth of engagement |
| Online | DAU, WAU, D30 retention | Long-term health, the only thing that pays |
| Online | Creator-side fairness | Distribution of impressions across authors |
In production, short-term and long-term metrics actively conflict. Clickbait pushes CTR up by +5-8% and crushes D30 retention by -2-3% within a quarter. The fix is guardrail metrics on every experiment — watch-time-per-session, D7 retention, comment-to-impression ratio — and a launch policy that blocks a CTR win if any guardrail trips at p < 0.05.
Common pitfalls
The first trap is treating ranking as a single model. Candidates who try to fit one giant network over the full 1B catalog burn the latency budget, miss the freshness story, and lose the interviewer in the first ten minutes. State the cascade explicitly, then drill into each stage when asked. The fix is structural, not algorithmic.
The second trap is ignoring feedback loops. The ranker is trained on data the previous ranker produced, so any bias amplifies over weeks. Items the model never showed have no labels, and the richer-get-richer dynamic squeezes new creators out. Counter it with random exploration slots (typically 3-5% of impressions), explicit cold-start scoring for new items, and importance weighting on logged data using inverse propensity scores.
The third trap is using watch-time as the only objective. It optimizes for autoplay-friendly long videos and degrades short-form variety, which then collapses session diversity and tanks retention. Multi-task with clicks, likes, shares, comments, and follows. Apply a log-transform on watch-time so a single 30-minute outlier does not dominate the gradient.
The fourth trap is forgetting position bias. The same item shown at slot 1 looks 2-3x more clicky than at slot 10, purely because of placement. If you train on raw clicks without correction, the ranker learns "items that the previous ranker liked are good" which is a tautology. Use position as a feature at train time and fix it to slot 1 at inference, or apply IPS weighting.
The fifth trap is declaring victory on offline metrics. NDCG@10 going up by 0.5% is meaningless without an online A/B test, because logged data is biased toward what the old system showed. Always run a real experiment, always check guardrails, never ship on offline wins alone.
Related reading
- Embeddings — Data Scientist interview
- Collaborative filtering — Data Scientist interview
- Cosine vs Euclidean distance — DS interview
- Class imbalance — Data Scientist interview
- ML latency optimization — Data Scientist interview
If you want to drill ML system design prompts like this one every day, NAILDD is launching with end-to-end recsys, ranking, and feed design questions used by FAANG-tier interview loops.
FAQ
How long should I spend on each stage in a 60-minute interview?
Roughly 5 minutes framing the problem and constraints, 5 minutes drawing the funnel, 15 minutes on candidate generation, 15 minutes on ranking, 5 minutes on re-ranking, and 10 minutes on metrics and experimentation. Reserve the last 5 minutes for whatever the interviewer pushes hardest on — usually feedback loops or cold-start.
Two-tower vs graph-based candidate generation — which one should I propose?
Two-tower is the safe default because it scales linearly, indexes cleanly into ANN stores, and serves in single-digit milliseconds. Graph-based methods (PinSage, GraphSAGE) shine when you have a strong social or content graph and tolerate the extra infra. In an interview, pick two-tower as the backbone and mention graph signals as one of the candidate sources, not the whole stage.
How do you handle cold-start users and cold-start items?
For cold-start users, fall back to trending and locale-popular candidates, lean on demographic and device features, and aggressively explore in the first session — typically 30-50% exploration impressions for the first 10 sessions. For cold-start items, score them with content-only features (text, image, video embeddings), give them a small free impression budget, and use Thompson sampling or UCB to escalate winners quickly.
Should I use deep learning or gradient boosting for the ranker?
Both work. Gradient boosting (XGBoost, LightGBM, CatBoost) is faster to train, easier to debug, and competitive when feature engineering is strong. Deep models (Wide & Deep, DCN-v2, DLRM) win when you have rich embeddings, cross features at scale, and multi-task heads. In production most large feeds run a deep model for the main ranker and a gradient-boosted model for a lightweight pre-ranker between candidate generation and the final ranker.
How do you A/B test a new ranker safely?
Start with a 1% holdback to confirm no obvious regressions, ramp to 5%, then to 50%, then to launch over 2-4 weeks. Gate every step on a fixed guardrail panel: CTR, watch-time, D7 retention, complaint rate, creator-side concentration index. Sample-ratio-mismatch checks every day — if assignment drifts by more than 1%, halt the experiment and audit logging before reading any metrics.
What is the single most common reason offline wins do not replicate online?
Selection bias in the logged training data. The offline metric is computed on impressions the old ranker chose to show, so any item the old ranker never surfaced has zero label coverage. The new ranker can look great on that biased slice and still lose online when it explores items the old model ignored. Fix it with logged-then-replayed counterfactual evaluation, off-policy estimators, and a small permanent exploration budget so the training distribution stays honest.