May 20, 2026·13 min read

Embeddings in the Data Scientist interview

Q: What embedding dimension should I pick?

The standard defaults are **384 (small), 768 (base), 1024 (large), and 1536 (OpenAI legacy)**. Bigger usually means slightly better quality, but the cost scales linearly with storage and roughly linearly with search latency once your index gets large. For most retrieval workloads, 768 is the sweet spot. Move to 1024 or higher only when you have benchmark evidence on your own data that the extra dimensions pay for themselves.

Q: Can I shrink an existing embedding without retraining?

Yes, three approaches. **Matryoshka embeddings** — supported natively in the newest OpenAI, Mistral, and Nomic models — let you truncate to the first N dimensions and still get most of the quality. **PCA** on a representative sample gives you a learned projection that often beats naive truncation on older models. **Knowledge distillation** trains a smaller encoder to match the larger one's output and is the heaviest option but produces the best quality-per-byte at scale.

Q: What is contrastive learning and why do modern encoders use it?

Contrastive learning trains an encoder so that **positive pairs** (semantically similar) end up close in vector space and **negative pairs** (unrelated) end up far apart. Concretely you minimize a loss like InfoNCE over batches that mix one positive with many negatives. Sentence-BERT, SimCSE, E5, and BGE all use variants of this objective. It is the dominant paradigm because it directly optimizes the geometry that retrieval and similarity tasks actually care about, instead of optimizing a proxy like next-token prediction.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Contents:

What embeddings actually are
word2vec
GloVe and FastText
Contextual embeddings: BERT
Sentence embeddings
Similarity search and vector stores
Common pitfalls
Related reading
FAQ

What embeddings actually are

An embedding is a fixed-length vector that represents a token, sentence, image, or user inside a semantically meaningful space. Two objects that are similar in meaning sit close together; two that are unrelated end up far apart. That single property — geometry encodes meaning — is what makes embeddings the lingua franca of modern ML: from search and recommendations to retrieval-augmented generation and zero-shot classification.

If your interviewer at OpenAI, Anthropic, or a Stripe ML team asks "how do you turn text into a vector?", they are not looking for a one-word answer. They want to hear the lineage: count-based vectors, word2vec, contextual transformers, sentence encoders, and what each of them gets wrong. The whole point of the question is to see whether you understand why we stopped using the previous generation, not to recite acronyms.

king   -> [ 0.21, -0.45,  0.78, ...,  0.13]   (dim 300)
queen  -> [ 0.19, -0.40,  0.81, ...,  0.15]
apple  -> [-0.32,  0.67,  0.12, ..., -0.81]

The cosine distance between king and queen is small; the distance between king and apple is large. That is the geometry the rest of the pipeline assumes. Break it — by mixing models, by skipping normalization, by silently changing the dimension — and every downstream system breaks in a quiet, hard-to-debug way.

word2vec

word2vec (Mikolov et al., 2013) was the first method that produced word vectors good enough for real downstream tasks. The core hypothesis is the distributional hypothesis: words that appear in similar contexts have similar meanings. word2vec turned that into a tractable training objective.

There are two architectures worth knowing by name. CBOW (Continuous Bag of Words) predicts a target word from the surrounding context — fast to train, decent on common words. Skip-gram does the opposite and predicts the context from a target word; it is slower but consistently better on rare words. Both train a single hidden layer whose width equals the embedding dimension, typically 100 to 300.

The training trick that made word2vec viable on billion-token corpora is negative sampling. Instead of a softmax over the entire vocabulary at every step, you frame the task as binary classification: is this (word, context) pair real, or is the context word a random sample from the vocabulary? You only update weights for a handful of negatives per positive, which collapses the cost from O(vocab) to O(k).

Property	Behaviour
Semantic similarity	Reflected in cosine distance
Vector arithmetic	`vec(king) - vec(man) + vec(woman) ≈ vec(queen)`
Polysemy	One vector per surface form — "bank" is averaged across senses
OOV	New words get no vector at all

Gotcha: word2vec produces one vector per token, full stop. If "bank" means a river bank in one sentence and a financial bank in another, both meanings get blended into the same embedding. That single limitation is why contextual models eventually took over.

GloVe and FastText

GloVe (Global Vectors, Stanford, 2014) stays in the same one-vector-per-token regime but trains on global co-occurrence statistics instead of local context windows. You build a word x word co-occurrence matrix over the whole corpus and factorize it. The end result is similar in quality to word2vec, but the optimization story is cleaner: you can write down what the loss is actually minimizing without invoking sampling tricks.

FastText (Facebook, 2016) extends word2vec to character n-grams. Each word is represented as the sum of its subword vectors. That single change solves two real production headaches at once: out-of-vocabulary words still get a sensible vector from their subwords, and morphologically related forms like running, ran, and runs end up close in space because they share characters. For languages with rich morphology — German compounds, agglutinative scripts, anything Slavic — FastText was the practical default before transformers arrived.

In 2026, none of these three are what you reach for in a serious production NLP system. They still earn their keep as baselines and in low-resource settings where you cannot afford to fine-tune a 1B-parameter encoder.

Contextual embeddings: BERT

BERT-style embeddings are contextual: the same token gets a different vector depending on the sentence it sits in. bank in "I went to the bank for a loan" and bank in "on the bank of the river" land in different regions of the space, because BERT's self-attention layers let each token attend to every other token in the input.

[CLS] I went to the bank for a loan [SEP]
                      |
                    BERT
                      |
[CLS_emb, I_emb, went_emb, ..., bank_emb_financial, ...]

[CLS] on the bank of the river [SEP]
                      |
                    BERT
                      |
[..., bank_emb_river, ...]

Three usage patterns are worth knowing cold. For classification, grab the [CLS] vector, push it through a small linear head and softmax — the classic BERT fine-tuning recipe. For token-level tasks like NER, you take the per-token embeddings and feed each into a classifier. For document-level retrieval, average-pooling or [CLS] from a vanilla BERT is bad — that is the gap sentence-encoders were built to fill.

Dimensionality: BERT-base is 768, BERT-large is 1024. Modern open encoders like E5 and BGE typically expose 384 to 1024 dimensions, sometimes with Matryoshka truncation so you can pick the dim at query time.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Sentence embeddings

The Sentence-BERT paper (Reimers and Gurevych, 2019) showed that vanilla BERT [CLS] vectors are surprisingly weak for sentence similarity — sometimes worse than averaging GloVe vectors. The fix is to fine-tune with a siamese architecture on pairs of sentences with a contrastive or triplet loss, so the geometry actually reflects sentence-level meaning.

Family	Examples	Notes
Closed API	OpenAI `text-embedding-3-small/large`	Strong, expensive, Matryoshka-friendly
Open general	`intfloat/e5-large`, `BAAI/bge-large-en`	Competitive on MTEB, easy to self-host
Open multilingual	`bge-m3`, `multilingual-e5-large`	Use when your corpus is not English-only
Specialized	BioBERT, FinBERT, code embeddings	Domain-tuned, much better in their niche

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-large-v2")
embs = model.encode(
    ["sentence one", "sentence two"],
    normalize_embeddings=True,
)
# embs.shape == (2, 1024)

Picking a model is a four-axis decision: language coverage, domain fit, budget (managed API vs. self-host), and context length. The MTEB leaderboard is a starting point, not an answer — always re-evaluate on a slice of your own data before you commit your entire vector store to one encoder.

Similarity search and vector stores

Once you have embeddings, the question is how to find nearest neighbours fast. For a few thousand documents, plain NumPy with cosine similarity is fine and you should not over-engineer it. Past roughly 100k vectors, you want a real index.

Store	Best fit	Index type
FAISS	Self-hosted, batch workloads	HNSW, IVF, PQ
Chroma	Prototyping, Python-first apps	HNSW
Qdrant	Production self-host, filters	HNSW with payload filtering
pgvector	Already have Postgres	HNSW, IVFFlat
Pinecone, Vespa	Managed at scale	Various
Milvus	Billion-scale corpora	HNSW, IVF, DiskANN

Load-bearing trick: at scale you do not run exact nearest neighbour search. You use an ANN (Approximate Nearest Neighbours) index — HNSW or IVF — and accept that recall is 90 to 99% in exchange for millisecond latency. A serious answer to "how do you make vector search fast?" mentions the recall/latency knob explicitly.

Two more techniques separate a junior answer from a senior one. Hybrid search combines dense vector results with a sparse signal like BM25 or TF-IDF — vectors miss exact keyword matches (model numbers, error codes, person names), and sparse methods miss paraphrases. Reciprocal Rank Fusion (RRF) is the standard zero-tuning way to merge the two ranked lists. Reranking is the second move: take the top-K from your fast retriever and rerun them through a cross-encoder that scores (query, doc) jointly. Cross-encoders are 10 to 100x slower per pair, but applied to the top 50 or 100 they give a large precision lift and are now standard in production RAG.

Common pitfalls

A senior interviewer is mostly listening for the failure modes you have hit in production. The pitfalls below are the ones that show up in real post-mortems, and saying any of them out loud will move you up a level on the rubric.

Reaching for word2vec in 2026 for serious NLP. Static word vectors are a fine choice when you need a CPU-only baseline or you are constrained to a few megabytes of model. For almost anything else — semantic search, classification, RAG retrieval, paraphrase detection — a small sentence-transformer beats word2vec by a wide margin while still running on commodity hardware.

Treating raw BERT [CLS] as a sentence embedding. Without fine-tuning, [CLS] vectors from a pretrained-only BERT are notoriously bad at semantic similarity. If you find yourself doing this in a notebook, swap to a sentence-transformer like E5, BGE, or text-embedding-3-small before you wire it into anything users will see.

Skipping normalization before cosine similarity. Cosine similarity only equals the dot product on unit-length vectors. Many models return non-normalized output by default, so always either pass normalize_embeddings=True or divide by the L2 norm yourself. Forgetting this turns "cosine" silently into "scaled dot product" and your top-K starts depending on length, not direction.

Mixing embeddings from different models in the same store. Vectors from model A and model B do not live in the same space, even if the dimension matches. One store, one model. If you have to migrate, re-embed the entire corpus and version the index; do not interleave.

Ignoring domain shift. Generic encoders are mediocre on specialized corpora — clinical notes, contract clauses, e-commerce SKUs full of model numbers. Either fine-tune on in-domain pairs, pick a domain-specific encoder, or layer a cross-encoder reranker on top. Hybrid search with BM25 often closes most of the gap for free in keyword-heavy domains.

Upgrading the encoder and forgetting to reindex. When you change the embedding model — even a patch version of the same family — every vector in the database is now in a slightly different space. Treat the embedding model like a database schema: pin the version, store it alongside the vectors, and trigger a full re-embed on any change.

If you want to drill questions like these against a timer, NAILDD ships 500+ ML and DS problems built around exactly this pattern.

FAQ

What embedding dimension should I pick?

The standard defaults are 384 (small), 768 (base), 1024 (large), and 1536 (OpenAI legacy). Bigger usually means slightly better quality, but the cost scales linearly with storage and roughly linearly with search latency once your index gets large. For most retrieval workloads, 768 is the sweet spot. Move to 1024 or higher only when you have benchmark evidence on your own data that the extra dimensions pay for themselves.

Can I shrink an existing embedding without retraining?

Yes, three approaches. Matryoshka embeddings — supported natively in the newest OpenAI, Mistral, and Nomic models — let you truncate to the first N dimensions and still get most of the quality. PCA on a representative sample gives you a learned projection that often beats naive truncation on older models. Knowledge distillation trains a smaller encoder to match the larger one's output and is the heaviest option but produces the best quality-per-byte at scale.

Embeddings, BM25, or hybrid for retrieval?

Hybrid wins on almost every public benchmark and most production corpora. Pure dense embeddings struggle on exact-keyword queries — product SKUs, error codes, legal citation numbers — because semantic models are trained to ignore surface form. Pure BM25 struggles on paraphrase and synonym matching. Combining them with Reciprocal Rank Fusion typically lifts recall@10 by 5 to 15 points over either alone, with essentially zero tuning.

What is contrastive learning and why do modern encoders use it?

Contrastive learning trains an encoder so that positive pairs (semantically similar) end up close in vector space and negative pairs (unrelated) end up far apart. Concretely you minimize a loss like InfoNCE over batches that mix one positive with many negatives. Sentence-BERT, SimCSE, E5, and BGE all use variants of this objective. It is the dominant paradigm because it directly optimizes the geometry that retrieval and similarity tasks actually care about, instead of optimizing a proxy like next-token prediction.

Multilingual or English-only encoder?

If your corpus is purely English, an English-only model like bge-large-en-v1.5 or e5-large-v2 will usually beat a multilingual model of the same size by a noticeable margin on English-only benchmarks. If you have any non-English text — even a small fraction — go straight to bge-m3 or multilingual-e5-large. The cross-lingual versions are intentionally trained so that "dog" in English and the same concept in another language land in nearby regions, which matters a lot once you query across locales.

Is this content official?

No. This is a study guide synthesized from the original papers (Mikolov 2013, Pennington 2014, Bojanowski 2016, Devlin 2019, Reimers 2019) and the public docs of sentence-transformers, FAISS, and the major vector databases. Verify model behaviour on your own data before shipping.