May 9, 2026·13 min read

Embedding quality evaluation for DS interviews

Q: How many queries do I need in a domain eval set?

A practical floor is **300 well-labeled query–document pairs** drawn from production traffic. Below that, NDCG@10 swings by 2 to 4 points from sampling noise and you cannot tell a real lift from a coin flip. Above 1,000 pairs you stop getting meaningful tightening of the confidence interval and mostly pay annotation cost for no decision benefit. Two-annotator agreement on a 300-pair set is the sweet spot.

Q: When does it make sense to fine-tune your own embedding model?

Three signals. First, you have at least **5,000 to 10,000 labeled or weakly-labeled positive pairs** from your domain. Second, the top open-source model on your domain eval scores below your business threshold. Third, you have inference budget to host it. Contrastive fine-tuning of BGE or GTE base on a few thousand pairs typically lifts NDCG@10 by 5 to 12 points on the target domain.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Contents:

Why interviewers ask this
Intrinsic evaluation
Extrinsic evaluation
MTEB and public benchmarks
Building a domain-specific eval set
Common pitfalls
Related reading
FAQ

Why interviewers ask this

Picture the moment: a senior DS at Stripe asks, "We swapped text-embedding-3-small for an open-weight model to cut inference cost — how would you prove the new embeddings are good enough to ship?" That question bundles three skills the panel is testing: do you know the difference between intrinsic and extrinsic evaluation, can you read a MTEB leaderboard without taking the top row at face value, and can you design an eval set that reflects your own domain.

Embeddings sit at the boundary between research and production — search at Notion, recommendation at Spotify, RAG at every AI-native startup, fraud signals at Stripe. When the interviewer probes quality evaluation, they want to see that you can defend a model choice with numbers, not vibes, not a single accuracy score on a single dataset.

This is the answer you would give with ten minutes on a whiteboard. The three load-bearing ideas: intrinsic eval probes the geometry of the embedding space, extrinsic eval measures task utility, and MTEB is a starting point, not a verdict.

Intrinsic evaluation

Intrinsic evaluation looks at properties of the embedding space without committing to a specific downstream task. It is fast, cheap, and tells you whether the geometry roughly matches human intuition about meaning. It is also the category most likely to mislead you if it is the only thing you measure.

The classic intrinsic probe is the similarity task. You take a curated set of word pairs or sentence pairs annotated by humans with similarity scores — say, the STS-B benchmark with scores from 0 to 5 — and you compute cosine similarity between the corresponding embeddings. The metric is the Spearman correlation between cosine and human score. A correlation of 0.85+ is competitive for modern sentence encoders; anything under 0.70 is a red flag for general-purpose use.

A second intrinsic probe is the analogy task, popularized by word2vec: king − man + woman ≈ queen. You measure the share of analogies the model resolves correctly when nearest-neighbor search is run on the residual vector. Modern contextual encoders score worse than static embeddings on this — which is a feature, not a bug, because analogies stopped being load-bearing once sentence-level models replaced word-level ones.

Clustering quality is the third intrinsic signal. Take a labeled corpus, embed each item, run k-means or HDBSCAN, and measure adjusted Rand index or normalized mutual information against the true labels. If the clusters do not line up with the labels, the geometry is not separating the categories you care about — and no downstream classifier will fix that without a lot of fine-tuning.

Intrinsic task	Typical metric	Strong score	What it tells you
Word/sentence similarity	Spearman ρ	0.80+	Geometry matches human intuition
Analogy	Top-1 accuracy	0.65+ (static)	Linear structure in space
Clustering	NMI / ARI	0.55+	Categorical separation
Isotropy	Mean cosine	< 0.10	No global "drift" direction

Load-bearing point: Intrinsic eval is necessary but never sufficient. A model that wins STS-B by two points can still lose your production retrieval benchmark by fifteen. Use it as a smoke test, not as a shipping criterion.

Extrinsic evaluation

Extrinsic evaluation uses the embeddings inside an actual downstream task and measures task performance. This is what the panel really wants to hear about, because it is what the business cares about: did revenue go up, did support tickets resolve faster, did the search box stop returning garbage on the top result.

The three downstream tasks that dominate interview discussions are classification, retrieval, and semantic textual similarity. Classification is the simplest: freeze the embeddings, train a linear probe or shallow MLP on top, and report accuracy or macro-F1. The trick is to keep the head small — if you stack a 3-layer transformer on top, you are no longer measuring the embedding, you are measuring the head.

Retrieval is the workhorse evaluation for RAG and search. You build a corpus, a set of queries, and a relevance judgment for each query–document pair (often binary: relevant or not). You then rank documents by cosine similarity and compute NDCG@10, MRR, Recall@k, and sometimes Hit Rate. The numbers matter in absolute terms — if your top model gets NDCG@10 of 0.45 and the baseline gets 0.42, that is a real lift worth a launch; if it is 0.451 vs 0.450, that is noise and you should stop optimizing.

Semantic Textual Similarity sits between intrinsic and extrinsic. You predict a similarity score for sentence pairs and correlate with human ratings — same as the intrinsic similarity task, but on longer, more task-realistic text. It is the standard supervised eval for sentence encoders and the bulk of the MTEB STS suite.

The downside of extrinsic eval is task specificity. An embedding that wins your retrieval task may lose someone else's classification task. Different tasks can rank embeddings in completely different orders, which is why papers report a battery of benchmarks rather than a single number.

MTEB and public benchmarks

The Massive Text Embedding Benchmark is the de facto leaderboard for sentence embeddings. It hosts 56+ tasks across 8 categories: classification, clustering, pair classification, reranking, retrieval, STS, summarization, and bitext mining. The Hugging Face leaderboard ranks models by mean score across all tasks, and new releases — Cohere Embed v3, OpenAI text-embedding-3-large, Voyage, BGE, GTE, E5 — first prove themselves there.

Treat the MTEB ranking as a prior, not a verdict. The leaderboard mostly covers English Wikipedia-flavored text. If your production data is SQL query logs, legal contracts, medical notes, or chat transcripts, the ranking on those tasks can invert by 10+ positions. A model that scores 64.2 on MTEB average may underperform a 60.0-average model on your customer support tickets because the support model was trained on conversational data.

Interview gold: when asked which embedding model you would pick, do not name a model. Instead, walk through the evaluation pipeline — we would shortlist three or four top-quartile MTEB models, then run our own retrieval benchmark on a sampled 1,000-query slice from production, then pick based on cost-per-query and latency at the chosen recall. That answer signals seniority. "We use OpenAI" does not.

Benchmark	Coverage	Best for
MTEB	56+ tasks, English-heavy	Initial shortlist
BEIR	Retrieval, 18 datasets	RAG-focused selection
MIRACL	Multilingual retrieval	Non-English production
LoCo	Long-context retrieval	Documents > 2k tokens

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Building a domain-specific eval set

The single highest-leverage move a DS can make in this conversation is to describe a domain eval set the team can run in CI. Generic benchmarks miss the quirks that actually matter — that account means something different at a bank than at Notion, that python is a snake on most of the internet and a programming language in your support inbox.

Start by collecting 100 to 1,000 representative query–document pairs from production. Sample queries weighted by frequency, not uniformly — you want the eval to reflect what users actually ask, not the long tail of unique requests. Label each pair as relevant or not relevant, ideally with two annotators and a third for ties. Even 300 well-labeled pairs beat 10,000 noisy ones for catching regressions.

A minimal Python harness looks like this — short enough that you can sketch it on a whiteboard:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def ndcg_at_k(scores, relevance, k=10):
    order = np.argsort(-scores)[:k]
    rel = relevance[order]
    discounts = 1.0 / np.log2(np.arange(2, k + 2))
    dcg = (rel * discounts).sum()
    ideal = np.sort(relevance)[::-1][:k]
    idcg = (ideal * discounts).sum()
    return dcg / idcg if idcg > 0 else 0.0

def evaluate(model, queries, docs, qrels, k=10):
    q_emb = model.encode(queries)
    d_emb = model.encode(docs)
    sims = cosine_similarity(q_emb, d_emb)
    return np.mean([
        ndcg_at_k(sims[i], qrels[i], k)
        for i in range(len(queries))
    ])

Run this harness on every candidate model and you have a number to defend. Bonus points for tracking it in CI and alerting when a model swap drops NDCG@10 by more than two points.

Sanity check: If your domain eval set takes longer than five minutes to run, nobody on the team will run it. Keep it small, fast, and version-controlled next to the code.

For high-stakes domains — medical, legal, finance — domain-specific embeddings like SciBERT, LegalBERT, or fine-tuned BGE variants often beat the best general models by 5 to 15 points NDCG@10. The interview-grade answer is: shortlist generic models, fine-tune one on a few thousand in-domain pairs with a contrastive loss, and benchmark all of them on the same domain eval set.

Want to practice this kind of end-to-end reasoning under interview pressure? NAILDD has a question bank built around exactly this pattern — applied ML decisions, not Leetcode tree problems.

Common pitfalls

The first pitfall is picking by MTEB average alone. The leaderboard rewards models that are decent at everything; your job is to be excellent at your task. A model that ranks 12th on average can still be the best for your retrieval slice. The fix is to ignore the average and look only at the categories that match your use case.

The second pitfall is evaluating on the same data you trained or tuned on. This happens constantly when teams use synthetic eval sets generated by the same LLM that produced training labels. The contamination inflates every metric and you ship a model that crashes on real traffic. The fix is to hold out a time-based slice from production before any model touches it.

The third pitfall is ignoring isotropy. Many encoders produce embeddings clustered in a narrow cone — all cosine similarities sit between 0.6 and 0.9, and rank order becomes brittle. Whitening can claw back 3 to 8 points NDCG@10 on retrieval. Measure mean off-diagonal cosine on a random sample and apply whitening if it is above 0.3.

The fourth pitfall is conflating embedding quality with chunking quality. In RAG, chunk boundaries — fixed-size, sentence-aware, semantic — often dominate the eval metric more than the embedding model does. If you swap the model and see no movement, the embedding is probably not your bottleneck. Ablate chunking and embedding independently.

The fifth pitfall is forgetting cost and latency. The best-scoring model on MTEB may take 80 ms per query on CPU when your product budget is 10 ms. A second-best model at 8 ms is the right call. Report a Pareto frontier of quality vs latency vs cost-per-million-queries, not a single quality number.

FAQ

How many queries do I need in a domain eval set?

A practical floor is 300 well-labeled query–document pairs drawn from production traffic. Below that, NDCG@10 swings by 2 to 4 points from sampling noise and you cannot tell a real lift from a coin flip. Above 1,000 pairs you stop getting meaningful tightening of the confidence interval and mostly pay annotation cost for no decision benefit. Two-annotator agreement on a 300-pair set is the sweet spot.

Is MTEB enough for picking a production embedding model?

No. MTEB is the right tool for shortlisting — pull the top 5 to 10 models in the categories that match your task, then evaluate them on a domain set you control. The leaderboard is English-Wikipedia-flavored; if your traffic is multilingual, conversational, or long-context, the ranking can shift dramatically. Use MTEB to narrow the funnel and your own eval to decide.

Should I prefer intrinsic or extrinsic evaluation when time is short?

Always extrinsic, on a task that mirrors your production use case. Intrinsic metrics are a fast sanity check — if STS correlation tanks after a model swap, something is wrong — but they should never be the deciding signal. The right ordering is: smoke-test intrinsic, decide with extrinsic on a domain set, monitor in production with online metrics like CTR or task completion.

When does it make sense to fine-tune your own embedding model?

Three signals. First, you have at least 5,000 to 10,000 labeled or weakly-labeled positive pairs from your domain. Second, the top open-source model on your domain eval scores below your business threshold. Third, you have inference budget to host it. Contrastive fine-tuning of BGE or GTE base on a few thousand pairs typically lifts NDCG@10 by 5 to 12 points on the target domain.

How do I evaluate embeddings for a multilingual product?

Use MIRACL or a multilingual slice of MTEB to shortlist, but strong English performance does not transfer. Multilingual encoders like multilingual-E5, BGE-M3, or Cohere multilingual tend to dominate. Build a per-language eval slice — even 100 pairs per language — and report metrics per language, not as a global average. A model can look fine on average and be unusable in Japanese.

What is the single biggest mistake junior candidates make here?

Naming a model instead of describing a process. "I would use text-embedding-3-large" is a guess, not an answer. The interview-grade response describes the funnel: shortlist by leaderboard, narrow by domain eval, decide on a quality-cost-latency Pareto, monitor in production. Even if you end up picking the same model, the reasoning is what gets you the offer.