NLP for the data science interview
Contents:
What interviewers actually ask
NLP shows up in nearly every modern data science loop, and the depth of the questioning scales with the role. A generalist DS candidate at Stripe or DoorDash will get subword tokenization, embeddings, and one BERT classification scenario. An NLP specialist at Meta or Anthropic will get pushed into attention math, fine-tuning trade-offs, and evaluation design. An applied LLM engineer at OpenAI or Notion will get prompt engineering, RAG pipeline failure modes, and alignment.
The load-bearing trick across all three levels is the same: the interviewer wants you to map a business task to the right method, not recite paper titles. If somebody asks "how would you classify support tickets by intent at 10M tickets a day?" the wrong answer is "I'd use GPT-4". The right answer walks through a fine-tuned encoder (DistilBERT or a small RoBERTa) for ~$200/month inference, with reasoning about latency and label budget.
Load-bearing trick: Memorize the method evolution (bag-of-words → word2vec → BERT → GPT) and the task matrix (classification, NER, QA, summarization). If you can draw both on the whiteboard inside two minutes, you have already passed the NLP screen at most companies.
Tokenization
Text becomes a sequence of integer IDs before any model sees it. The three families you need to name:
Word-level treats each whitespace-delimited word as its own token. Vocabulary balloons past 1M for any non-toy corpus, and any unseen word at inference time becomes the dreaded <UNK> token. Nobody ships this in 2026 outside of legacy systems.
Character-level treats each character as a token. Vocabulary stays tiny — a few hundred symbols — but sequences become 5-10x longer, which kills self-attention compute (it is quadratic in sequence length). Used mostly in niche settings like protein sequences or noisy OCR text.
Subword is what every modern model uses. Frequent words stay whole; rare words get split into recognizable sub-pieces. The three flavors that come up in interviews:
| Tokenizer | Algorithm | Used by | Vocab size |
|---|---|---|---|
| BPE | Iteratively merge most-frequent byte pairs | GPT-2/3/4, LLaMA | 50k-100k |
| WordPiece | Merge by likelihood gain, not raw frequency | BERT, DistilBERT | ~30k |
| SentencePiece | Treats raw text as a stream, no pre-tokenization | T5, mBART, multilingual | 32k-256k |
The interview question is almost always "why subword?" The answer has three parts: it solves OOV (a new word like "tokenomics" splits into token + ##omics), it shrinks the vocabulary by 10-30x vs word-level, and it handles morphologically rich languages where a stem combines with dozens of suffixes.
The embedding evolution
This is the question that separates candidates who read one blog post from those who actually understand the field. The method timeline matters because each step fixed a specific failure of the previous one.
| Era | Method | Core idea | Killer limitation |
|---|---|---|---|
| 2003-2013 | Bag-of-words / TF-IDF | Count words, weight by rarity | No notion of meaning; "great" and "excellent" are orthogonal |
| 2013 | word2vec / GloVe | Dense vector per word from co-occurrence | One vector per word; "bank" means river-bank and money-bank simultaneously |
| 2018 | ELMo, BERT | Contextual embeddings from a deep encoder | Bidirectional, but expensive; not generative |
| 2018-now | GPT family | Causal decoder, scale-driven emergent abilities | Costly per-token; weaker on pure classification than a fine-tuned encoder |
The classic word2vec demonstration — vector("king") - vector("man") + vector("woman") ≈ vector("queen") — is still cited in interviews, but every production NLP stack today uses contextual embeddings, where the vector for "bank" depends on whether the surrounding tokens say "river" or "deposit". Plain word2vec survives as a fast baseline for retrieval and as a teaching example.
If you can articulate why contextual beats static embeddings in three sentences, you have a leg up on most candidates.
BERT and encoder models
BERT (Bidirectional Encoder Representations from Transformers) is encoder-only. Pretraining is Masked Language Modeling: ~15% of tokens are replaced with [MASK] and the model learns to predict them from both sides of context. The original paper also used Next Sentence Prediction, but RoBERTa showed NSP was mostly noise, and modern variants drop it.
Encoders shine on tasks where you need a representation of an entire span and you can afford bidirectional attention:
- Text classification with a small head on the
[CLS]token (sentiment, intent, spam). - Named entity recognition with a per-token classification head.
- Extractive QA with two heads predicting answer-start and answer-end spans.
- Sentence-level embeddings via Sentence-BERT for retrieval and clustering.
The interview question is "why bidirectional?" — because understanding the word "bank" benefits from both the left ("the river") and the right ("was flooding") contexts. A pure decoder like GPT only sees left context during training and ends up weaker on classification at fixed parameter budget.
GPT and decoder models
GPT (Generative Pre-trained Transformer) is decoder-only with causal masking — each token attends only to previous tokens. Training objective: predict the next token. The scale arc is worth memorizing because interviewers love it:
| Model | Year | Parameters | Context window |
|---|---|---|---|
| GPT-1 | 2018 | 117M | 512 |
| GPT-2 | 2019 | 1.5B | 1,024 |
| GPT-3 | 2020 | 175B | 2,048 |
| GPT-4 | 2023 | ~1T+ (mixture-of-experts, undisclosed) | 8k-128k |
| GPT-4o / Claude 3.5 / Gemini 1.5 | 2024 | undisclosed | 128k-2M |
Decoders fit naturally for generation, few-shot in-context learning, agentic tool use, and conversational interfaces. The trap candidates fall into: assuming "bigger model = always better". For a single-domain classification task with 100k labeled examples, a fine-tuned DistilBERT will beat GPT-4 on accuracy, latency, and cost simultaneously.
Gotcha: "We need to classify 50M support emails by topic" is not a job for GPT-4. The right answer is a fine-tuned encoder with ~$0.0001 per inference, not a frontier model at ~$0.01 per inference. Pick the model that fits the loop's economics, not the one that sounds impressive.
Fine-tuning vs prompt engineering
Once you have a pretrained model, you have two roads to a task-specific system.
Fine-tuning updates model weights on your labeled data. The three variants:
- Full fine-tuning updates all parameters. Best quality, highest GPU cost. Practical up to ~10B parameters on a single A100/H100 node.
- LoRA / adapters insert small trainable matrices and freeze the base weights. ~0.1-1% of parameters trained, near-full-FT quality on most tasks. The default for any model over 7B.
- Prompt tuning / prefix tuning trains a small soft-prompt embedding and freezes everything else. Cheapest, weakest, useful for very narrow tasks.
Prompt engineering leaves the model frozen and changes the input:
- Zero-shot: describe the task in the prompt, no examples.
- Few-shot: include 2-8 worked examples (in-context learning).
- Chain-of-thought: ask the model to "think step by step", which improves reasoning on math, multi-hop QA, and code.
- RAG (Retrieval-Augmented Generation): retrieve relevant passages from a vector DB and stuff them in the prompt. The default architecture for any question-answering product touching domain documents.
The interview question is "when do you fine-tune vs prompt?" and the answer has three columns:
| Choose | When |
|---|---|
| Fine-tuning | Stable schema, ≥1k labeled examples, latency or unit-cost matters, narrow domain |
| Prompting + RAG | Frequently changing facts, no labels yet, latency budget allows 1-3s, broad domain |
| Hybrid (small FT + RAG) | Production NLP at scale, ~2024-2026 industry default |
Task-to-method matrix
This is the cheat sheet interviewers expect you to draw from memory. The matrix matches the four canonical NLP tasks to the model family that fits, plus the evaluation metric you would actually report.
| Task | Best fit | Why | Standard metric |
|---|---|---|---|
| Text classification | Fine-tuned encoder (BERT, RoBERTa, DistilBERT) | Bidirectional context, cheap inference, fixed-shape output | F1 (macro for imbalance), PR-AUC |
| NER | Encoder + token classification head | Per-token labels, span boundaries matter | F1 over spans (exact-match) |
| Extractive QA | Encoder with start/end span heads | Answer lives inside the passage, no generation needed | Exact Match, F1 over tokens |
| Abstractive summarization | Encoder-decoder (T5, BART) or LLM | Output is new text, length-controlled | ROUGE-1/2/L, plus human eval |
| Open-domain QA | RAG: retriever + decoder LLM | External knowledge required, freshness matters | Retrieval recall@k, answer F1, faithfulness |
| Translation | Encoder-decoder (NLLB, mBART) or LLM | Source-to-target sequence mapping | BLEU, COMET, chrF |
| Chat / instruction following | Decoder LLM with RLHF/DPO | Open-ended generation, multi-turn | Human preference, MT-Bench, harm rate |
A senior candidate at Anthropic or Snowflake will be pushed further: what if classification labels are added monthly? (Hybrid: encoder for the frozen 90%, few-shot LLM gate for new labels.) What if QA documents update hourly? (RAG with a freshness budget on the retriever index.) The matrix is the starting point, not the answer.
Common pitfalls
The mistake junior candidates make most often is ignoring language and domain mismatch. An English-only BERT gives a ~30% absolute accuracy drop on French or Japanese tickets. The fix is multilingual checkpoints (XLM-R, mBERT) or a language-specific variant. Domain mismatch is the same problem — a Wikipedia-pretrained BERT underperforms a domain-tuned variant on medical or legal corpora by 5-15 F1 points.
A second pitfall is using accuracy on imbalanced classification. If 98% of support tickets are "general inquiry", a model that always predicts that class scores 98% accuracy and is useless. Macro-F1 or per-class PR-AUC is the right reporting target. Weight the loss (class_weight='balanced') or oversample the minority, then report per-class precision and recall.
A third trap is ignoring context length limits. Vanilla BERT caps at 512 tokens — roughly 350-400 English words. Longer documents get truncated, and most candidates do not realize their model is silently losing the back half of every legal contract. Fixes: sliding windows, hierarchical encoders, or long-context architectures like Longformer (4k), BigBird (4k), or flash-attention encoders pushing 16k-32k.
A fourth one is reaching for an LLM when an encoder fits better. Asking GPT-4 to label 10M emails by sentiment is a way to spend $100k on a job a $200 fine-tuned RoBERTa would do better, faster, and more reproducibly. Use frontier LLMs where their strengths matter — open-ended generation, few-shot adaptation, complex reasoning — not where you have plenty of labels and a fixed schema.
A fifth, increasingly common pitfall is shipping a RAG system without measuring retrieval recall separately from generation. If the retriever misses the relevant passage, no LLM will recover. Measure recall@5 on a held-out QA set first; only then evaluate end-to-end faithfulness.
Related reading
- BERT vs GPT data science interview
- GPT architecture data science interview
- Bayesian methods data science interview
If you want to drill NLP scenarios like this every day, NAILDD is launching with hundreds of NLP and ML system-design questions from real DS interview loops.
FAQ
What is attention in one paragraph?
Attention is a learned weighted average over a sequence. When the model processes a target token, it computes a similarity score (query against keys), normalizes those into weights, and combines value vectors. Self-attention uses the same operation where queries, keys, and values come from one sequence — stacking many such layers is what makes a transformer. It replaced RNNs because every position computes independently, so you train on GPU at sequence-level batches instead of stepping through time.
Word embeddings vs sentence embeddings — which do I want?
Word embeddings (word2vec, fastText) give one vector per word and are the right primitive for token-level tasks or features into a classical model. Sentence embeddings (Sentence-BERT, OpenAI text-embedding-3, Cohere embed-v3) give one vector per phrase or document — what you want for semantic search, deduplication, clustering, or any "is A similar to B" question. Production systems now use sentence-level contextual embeddings; raw word2vec is mostly a teaching artifact.
How much NLP project work do I need for a junior DS role?
Two end-to-end projects is the realistic bar at companies like Linear, Airbnb, or DoorDash for entry-level applied DS. One should be a fine-tuning project — pick a public dataset (AG News, IMDB, CoNLL-2003), fine-tune a DistilBERT, and report F1 with a confusion matrix. The second should be a RAG or embedding-search project — load a small corpus, build a FAISS index, and answer questions through retrieval + LLM. Both on GitHub with a clear README credibly demonstrates you can ship.
Is fine-tuning still relevant when LLMs are this good?
Yes, and it is becoming more important, not less. A fine-tuned 1B-parameter open-weight model on your own infrastructure costs roughly 50-200x less per inference than a frontier API call, and for narrow tasks it matches or beats the API on quality. The 2026 production pattern at DS-mature companies is hybrid: route easy / high-volume / well-labeled traffic to a fine-tuned small model, and route long-tail / novel / open-ended traffic to a frontier LLM through prompting and RAG.
Are these answers official?
No. This article is built from the canonical papers (Vaswani 2017 on attention, Devlin 2018 on BERT, Brown 2020 on GPT-3, plus the LoRA and RAG papers) and from candidate debriefs across applied DS and ML loops at large tech companies. Treat it as a study guide, not a substitute for the originals.