Collaborative filtering in DS interviews
Contents:
Why this shows up in every DS loop
Recommender systems pay the bills at Netflix, YouTube, Amazon, DoorDash, and Pinterest. That makes collaborative filtering one of the few ML topics where the interviewer will keep digging long after you finished the textbook answer. The question is rarely "what is CF" — it is "walk me through how you'd ship a recsys for a new homepage carousel given mostly implicit signals and a sparse user-item matrix".
For a Senior Data Scientist loop at a FAANG-tier company (base $190k-$240k, TC north of $350k at Meta or Google L5), expect a full round on this. Mid-level loops at Stripe, Airbnb, or Uber spend twenty minutes on matrix factorization and another twenty on production trade-offs. The candidates who clear the bar are the ones who can explain why MSE is wrong for clicks and what they'd do instead.
Load-bearing trick: every collaborative filtering algorithm — from 1990s user-based to 2024 two-tower — is some form of "learn embeddings such that observed (user, item) pairs are closer in latent space than unobserved ones". If you internalize that, the algorithm zoo collapses into one mental model.
Memory-based: user-based and item-based
The oldest school. No model, no embeddings — just a similarity matrix and a lookup.
User-based CF answers "users like you also liked X". For a target user u, you find the k nearest neighbors by cosine or Pearson similarity over their rating vectors, then recommend items those neighbors liked that u hasn't seen. Item-based CF flips it: for every pair of items, you precompute similarity from co-rating patterns, then recommend items similar to ones the user already interacted with.
Item-based is the production default — Amazon's classic paper showed why. Items are stable, users drift. You can precompute the item-item similarity matrix overnight and serve from a key-value lookup with single-digit-millisecond latency. User-based forces you to recompute neighborhoods as behavior shifts, which gets expensive fast.
| Approach | Strengths | Weaknesses | When to use |
|---|---|---|---|
| User-based | Intuitive, easy to explain | O(N²) over users, drifts daily | Small communities, social graphs |
| Item-based | Stable, precomputable, fast | Misses serendipity, popularity bias | E-commerce, content catalogs |
| Matrix factorization | Latent factors, handles sparsity | Cold start, batch training | Default baseline at scale |
| Two-tower / neural | Rich features, real-time | Infra heavy, harder to debug | Modern production at FAANG |
The honest weakness: a similarity matrix is O(N²). At 100M users or 10M items you cannot materialize it, and sparse overlaps make similarity numerically unstable.
Matrix factorization, the load-bearing trick
Here is where most interviews actually live. R is your users × items interaction matrix — mostly empty, with explicit ratings (1-5 stars) or implicit signals (clicked, watched, purchased) in the observed cells. Matrix factorization assumes:
R ≈ U · Vᵀ
U: users × k latent user embeddings
V: items × k latent item embeddings
k: latent dim, typically 32 to 256The loss, for explicit feedback, is squared error over observed cells only, with L2 regularization to keep embeddings well-behaved:
L = Σ_{(u,i) ∈ observed} (R_ui - U_u · V_iᵀ)² + λ(||U||² + ||V||²)The single most-missed nuance in interviews is "only over observed cells". If you sum over all (u, i) and treat missing as zero, you are telling the model that every unseen pair is a strong negative. That kills recall for niche items and rewards popularity bias.
# Vanilla SGD for explicit MF
import numpy as np
def train_mf(observed, n_users, n_items, k=64, lr=0.01, lam=0.02, epochs=20):
U = np.random.normal(scale=0.1, size=(n_users, k))
V = np.random.normal(scale=0.1, size=(n_items, k))
for epoch in range(epochs):
np.random.shuffle(observed)
for u, i, r in observed:
pred = U[u] @ V[i]
err = r - pred
U[u] += lr * (err * V[i] - lam * U[u])
V[i] += lr * (err * U[u] - lam * V[i])
return U, VOnce trained, the score for any pair is U_u · V_i. Top-K is the top dot products over the item matrix — use ANN libraries like FAISS or ScaNN in production. Similar items are cosine neighbors of V[i].
The embeddings are not interpretable on their own, but their geometry encodes co-occurrence structure — PCA them and clusters often match genre or behavior segment.
ALS — Alternating Least Squares
SGD works fine on a single box up to ~100M interactions. Beyond that you reach for ALS, which is what Spark MLlib's ALS runs under the hood.
The trick: fix U, then optimizing V is linear least-squares with a closed form. Fix V, optimize U. Alternate. Each step is embarrassingly parallel because updating V_i only needs U_u for users who interacted with item i.
∂L/∂V_i = 0 → V_i = (Σ_u U_u U_uᵀ + λI)⁻¹ Σ_u r_ui U_uSanity check: ALS converges in 10-20 iterations on most production datasets and needs no learning rate. The trade-off is memory — a rank-128 model over 100M users wants something like 100GB across executors.
ALS wins on distributed training, no learning-rate babysitting, and stable convergence. It is less flexible than SGD for exotic loss terms or feature interactions.
BPR and implicit feedback
Real user data is implicit: clicks, watches, dwell time, add-to-cart, purchase. You rarely get 1-5 star ratings outside streaming services circa 2009.
Naive MF with MSE on implicit data fails because absence of signal is not a negative. A user didn't click because they didn't see it, not because they hated it. Train MSE on r ∈ {0, 1} and the model spends most capacity predicting zero — wrong objective.
Two production-grade fixes.
BPR (Bayesian Personalized Ranking) reframes the problem as pairwise ranking. For each observed positive (u, i+), sample a negative (u, i-) the user has not interacted with, and push the model toward score(u, i+) > score(u, i-):
L = -Σ log σ(U_u · V_{i+}ᵀ - U_u · V_{i-}ᵀ) + λ · regularizationThis is the canonical implicit-feedback objective and it is what powers most pre-deep-learning recsys at scale.
WALS (Weighted ALS) keeps the squared-error frame but introduces confidence weights c_ui = 1 + α · r_ui. Observed interactions get high confidence, unobserved cells get low confidence — so the model still learns from missing data, but doesn't treat it as a hard zero.
L = Σ_{all (u,i)} c_ui · (p_ui - U_u · V_iᵀ)² + λ · regWhere p_ui = 1 if interaction observed, 0 otherwise. Implicit ALS in Spark is exactly this. α ≈ 40 is a common starting point per Hu et al. 2008.
Cold start without panic
Cold start is the question interviewers love because it has no clean answer and reveals system-thinking. There are three flavors.
User cold start — new user, no history. Fall back to popularity top-N, an onboarding question ("pick 3 artists"), demographic features, or a content-based model until you have 5-10 interactions.
Item cold start — new product, no interactions, so MF has no embedding for it. Fix: use content features (text embeddings, image embeddings, metadata) projected into the same latent space. This is where two-tower shines — the item tower learns from features rather than IDs.
System cold start — building recsys from scratch. Start with popularity, ship in two weeks, graduate to MF around 1M interactions. Don't reach for transformers on day one.
Modern stack: two-tower, sequential, graph
2010s was MF. 2020s is neural and graph, but the intuition is identical — learn embeddings.
Two-tower (DSSM) runs a user tower and item tower as separate networks. User features (history, context) feed one; item features (text, image, category) feed the other. Score is the dot product. This is the dominant production architecture at Google, Meta, and most large platforms.
# Sketch — two-tower retrieval
user_emb = UserTower(user_features) # (batch, d)
item_emb = ItemTower(item_features) # (batch, d)
score = (user_emb * item_emb).sum(dim=-1) # dot productSequential models treat user history as a sequence. SASRec uses self-attention; BERT4Rec does masked-item prediction. These dominate session-based recsys.
Graph methods operate on the user-item bipartite graph. LightGCN is a stripped-down graph convolution that often beats heavy variants. PinSage powers Pinterest's home feed.
And in production at any serious scale, you do multi-stage retrieval:
Candidate generation → Ranking → Re-ranking
1M items → ~1000 candidates → ~50 → top 10
(two-tower / ANN) (cross-features, (diversity,
gradient boosting, business rules,
deep models) exploration)Get this picture into your interview answer and you instantly sound senior.
Common pitfalls
The most expensive mistake is using MSE on implicit feedback. The model spends most of its gradient predicting the average click rate (near zero), and recommendations collapse to whatever was popular in the training window. The fix is BPR, WALS, or sampled softmax — pick one and explain it.
Another trap is ignoring popularity bias. MF rewards popular items because they have more signal, so your "personalized" feed looks suspiciously like the global top-100. Mitigate via negative sampling weighted by item popularity, inverse propensity weights, or diversity injection at re-ranking. If you cannot name one debiasing method, expect a follow-up.
A third pitfall is computing cosine similarity on raw interaction counts without normalizing for user activity. A power user with 500 watches dominates every calculation; an occasional user contributes noise. Always normalize — TF-IDF-style item weights or L2-normalized user vectors.
The classic eval mistake is a random train/test split on temporal data, which leaks future information and inflates offline metrics. Recsys needs temporal splits: train before time t, test on t and after. Otherwise your Hit@10 of 0.42 offline becomes CTR of 1.2% in production and you own a bad retro.
Finally, trusting offline metrics blindly. Hit@K and NDCG@K are necessary but not sufficient — the literature is full of papers where offline improved and online A/B regressed. Plan the online experiment, define guardrails (session length, retention, revenue per user), and budget 2-4 weeks for a clean read.
Related reading
- Bias and fairness in data science interviews
- Hyperparameter tuning for ML interviews
- How to calculate cosine similarity in SQL
- How to build lookalike audiences in SQL
- How to calculate Jaccard similarity in SQL
If you want to drill recsys interview questions across MF, ALS, BPR, and two-tower like this every day, NAILDD is launching with structured problem sets across exactly this pattern.
FAQ
How many latent factors should I pick for matrix factorization?
The standard range is k = 32 to 256, with 64 or 128 as common defaults. Cross-validate against your real ranking metric (NDCG@10 or Hit@K) on a temporal holdout. Larger k overfits faster on sparse data — if you go from 64 to 256, also bump λ and add early stopping.
Is matrix factorization still good enough for production at a serious company?
Yes, as a baseline and often as the candidate-generation layer. Spark ALS handles billions of interactions and produces a strong retrieval set. Modern systems layer a deeper ranking model (gradient boosting or a neural ranker) on top. Skipping MF and jumping to transformers is a fast way to ship something that costs ten times as much and performs worse.
What are Factorization Machines and should I mention them?
FMs generalize MF to arbitrary feature interactions — beyond user and item IDs, you can plug in context, device, time of day, and the model learns pairwise interactions via low-rank factors. Useful for cold start because they score new items from features. Mentioning FMs signals you read past the textbook chapter.
How do I evaluate a recsys offline before A/B testing?
Use temporal splits, not random. Train before time t, predict for users in the test window, compute Hit@K, NDCG@K, MAP@K, MRR at multiple K values. Always sanity-check against a popularity baseline — if a two-tower model only beats popularity by 2 percentage points, the complexity isn't worth it.
Is this content official?
No. This is an interview-prep digest based on standard references — Koren 2009, Rendle 2009 (BPR), Hu et al. 2008 (implicit feedback), plus LightGCN and SASRec. Treat it as a map, not gospel.