BERT vs GPT for the Data Science interview
Contents:
Why this comes up in the loop
The canonical NLP opener at OpenAI, Anthropic, or Meta is what is the difference between BERT and GPT, and when would you use each? It separates candidates who memorized "attention is all you need" from candidates who can ship a model.
Both architectures share the same transformer block, so the lazy answer "they're basically the same, trained differently" fails. The encoder vs decoder split, the bidirectional vs causal attention mask, and the MLM vs next-token objective are three orthogonal axes that senior interviewers probe one by one. By the end you should justify a pick for a production NER service and explain why DistilBERT at ~67M params can beat GPT-4 via API on a latency-bound classification job.
Architecture: encoder vs decoder
BERT (Bidirectional Encoder Representations from Transformers) is encoder-only. Every token attends to every other token, in both directions. The model produces one hidden state per input token but has no autoregressive head — no way to emit a new token one at a time. Hand BERT a prompt and ask it to continue, and nothing useful happens.
GPT (Generative Pre-trained Transformer) is decoder-only. The self-attention mask is causal: token i attends to tokens 0..i, never to i+1. That mask is what makes the model generative — at inference you sample the next token, append it, and repeat. Classification, summarization, code, and chat are all expressed as next-token prediction.
BERT (encoder-only, bidirectional):
[CLS] the quick [MASK] fox jumps [SEP]
every token attends to every other token
MLM head predicts [MASK] -> "brown"
GPT (decoder-only, causal):
the quick brown fox
causal mask: token i sees tokens 0..i only
LM head predicts "jumps" from the prefixT5 and BART keep both halves — a bidirectional encoder plus an autoregressive decoder with cross-attention between them. That fits seq2seq problems where input and output are different strings: translation, summarization, grammatical error correction.
| Property | BERT (encoder) | GPT (decoder) | T5 / BART (enc-dec) |
|---|---|---|---|
| Attention mask | Bidirectional | Causal (left-to-right) | Bidirectional + causal + cross-attention |
| Training objective | Masked LM (15% mask) | Next-token prediction | Span corruption (T5) / denoising (BART) |
| Generates text? | No native head | Yes, autoregressive | Yes, on decoder side |
| Strongest at | Classification, NER, extractive QA, embeddings | Chat, code, few-shot prompting | Translation, summarization |
| Typical sizes | 67M (Distil) - 435M (Large) | 7B - 1T+ | 220M - 11B |
| Inference cost | Single forward pass | One pass per generated token | Encoder once + one decoder pass per token |
Load-bearing trick: the architecture choice is mostly a question about the attention mask, not the transformer block itself. Bidirectional mask + MLM head = BERT family. Causal mask + LM head = GPT family. Everything else is a parameter count or a tokenizer swap.
Pre-training objectives: MLM vs causal LM
Masked Language Modeling (MLM) is BERT's objective. Pick 15% of tokens at random; of those, replace 80% with [MASK], replace 10% with a random vocab token, leave 10% unchanged. The model predicts the original at every masked position using both left and right context. The 80/10/10 split is deliberate — if every masked position were [MASK], the token would never appear in its embedding role and downstream performance would collapse.
Next Sentence Prediction (NSP) was BERT's second objective. RoBERTa (2019) showed NSP was net-harmful and dropped it. Every modern BERT-style model follows RoBERTa.
Causal LM is GPT's objective. Loss is -log P(token_i | token_<i) summed over positions. Span corruption (T5) replaces contiguous spans with sentinels. Permutation LM (XLNet) randomly permutes prediction order. Neither has displaced MLM or causal LM.
Empirically across GLUE, SuperGLUE, and MMLU: MLM encoders win on understanding tasks per parameter, causal-LM decoders win on generation. The gap narrows with scale — a 70B Llama can match a fine-tuned 110M BERT on classification at ~600x the inference cost.
Tokenization
Transformer models consume integer token IDs from a vocabulary built by a sub-word tokenizer trained alongside the model. Get the tokenizer wrong and the weights are unusable.
Byte-Pair Encoding (BPE). Start from bytes; repeatedly merge the most frequent adjacent pair until the vocabulary reaches the target size. GPT-2 used BPE on UTF-8 bytes; GPT-4 and Llama still use BPE variants. The byte-level part matters — BPE can encode any Unicode string without an <unk> token.
# BPE merge trace over a tiny corpus
corpus = ["low", "lower", "newest", "widest"]
# step 1: ("e", "s") -> "es"
# step 2: ("es", "t") -> "est"
# step 3: ("l", "o") -> "lo"
# "lower" now tokenizes as ["lo", "w", "e", "r"]WordPiece (BERT). Same idea but picks merges by likelihood gain. SentencePiece (T5, XLNet, Llama) operates on raw characters without pre-tokenization — the only sane choice for languages without whitespace word boundaries. Tiktoken is OpenAI's fast Rust BPE shipped for the GPT family.
| Tokenizer | Used by | Vocab | Notes |
|---|---|---|---|
| WordPiece | BERT, DistilBERT, ELECTRA | ~30k | Whitespace + punctuation pre-tok |
| Byte-BPE | GPT-2, GPT-3, RoBERTa | 50k | Byte-level, never emits <unk> |
| SentencePiece | T5, XLNet, ALBERT, Llama | 32k-128k | Raw text, no whitespace assumption |
| Tiktoken | GPT-3.5, GPT-4, GPT-5 | 100k-200k | Optimized for English + code |
A practical gotcha: a 1,000-character English paragraph is roughly 250 tokens in cl100k, 300 in Llama's SentencePiece, and 350 in BERT's WordPiece. Code is denser — a single } is one token in Tiktoken but two or three in WordPiece. Quoting "characters" instead of "tokens" for context window math is a red flag.
Fine-tuning patterns
Classic BERT recipes are short. Classification: feed [CLS] sentence [SEP], take the position-0 hidden state, linear + softmax, cross-entropy loss. NER / token tagging: linear head over every position's hidden state. Extractive QA (SQuAD): feed [CLS] question [SEP] paragraph [SEP], two heads emit start_logit and end_logit per paragraph token.
GPT-style models can be fine-tuned the same way, but in 2026 the dominant pattern is low-rank adapters instead of full fine-tuning.
# LoRA: freeze base weights, learn low-rank A, B
# effective weight is W + (B @ A) with rank r << d
import torch, torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, base: nn.Linear, r: int = 8, alpha: float = 16.0):
super().__init__()
self.base = base
for p in self.base.parameters():
p.requires_grad = False
d_out, d_in = base.weight.shape
self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
self.B = nn.Parameter(torch.zeros(d_out, r))
self.scale = alpha / r
def forward(self, x):
return self.base(x) + self.scale * (x @ self.A.T @ self.B.T)With rank r = 8, alpha = 16, LoRA on a 7B Llama trains ~0.1% of parameters, fits on a 24 GB GPU with QLoRA's 4-bit quantization, and recovers 95-99% of the quality of a full fine-tune on most tasks. Every model-team interviewer in 2026 expects you to mention LoRA, QLoRA, or prefix tuning when the question is "how would you adapt this to our domain."
When to pick which
Reach for BERT or a descendant — RoBERTa, DistilBERT, DeBERTa-v3, ModernBERT — for closed-set classification, sequence labeling, extractive QA, or retrieval embeddings, especially when latency matters. DeBERTa-v3-base serves at single-digit ms p50 on an A10 and costs effectively nothing per inference. The same task through GPT-4 costs $0.005 to $0.030 per request and is 100x slower.
Reach for a decoder-only LLM — GPT-4, GPT-5, Claude, Gemini, Llama 3, Mistral Large, Qwen 2 — for generative tasks, open output schemas, few-shot adaptability, or when you'd otherwise maintain dozens of task-specific models. Chat, code, agents, and structured extraction with tool use fit this profile.
Reach for an encoder-decoder like T5, FLAN-T5, or BART for clean seq2seq with paired data: domain translation, fixed-format summarization, schema-to-schema rewrites.
Sanity check: if the spec is "1,000 QPS at 50ms p99 for sentiment classification," the answer is a fine-tuned encoder. If the spec is "user asks anything and we extract a JSON payload," the answer is a prompted decoder LLM with a schema validator. Most production NLP in 2026 uses both on different paths.
Modern model landscape
The 2018 originals are worth knowing by name, but the 2026 production menu is a smaller set of well-supported families.
| Family | Examples | Type | Why it shows up |
|---|---|---|---|
| BERT lineage | RoBERTa, DistilBERT, DeBERTa-v3, ModernBERT | Encoder | Classifiers, NER, re-rankers |
| Sentence encoders | sentence-transformers, BGE, E5, Cohere embed-v3 | Encoder | Vector search, RAG, semantic dedup |
| Open-weight decoders | Llama 3, Mistral, Mixtral, Qwen 2, Phi-4, DeepSeek-V3 | Decoder | Self-hosted chat, domain LLMs |
| Closed-weight decoders | GPT-4 / GPT-5, Claude, Gemini | Decoder | Frontier reasoning, agents |
| Encoder-decoder | T5, FLAN-T5, BART, mBART | Enc-Dec | Translation, summarization |
Seniority test: which would you pick for a resume-to-job-description matcher at 100k requests/day? Not GPT-5 — BGE-large-en-v1.5 for embeddings plus a small bge-reranker-v2-m3. That stack runs at ~$0.0001 per request self-hosted and beats a prompted LLM on nDCG@10.
Common pitfalls
The most expensive mistake is using BERT for generation. BERT is encoder-only — no autoregressive head, no causal mask, no way to emit tokens one at a time. People reach for it because it is small and familiar, then spend a week on a hack that masks one token at a time and produces incoherent output. The fix is to pick the right family up front: anything ending in "and then we produce text" belongs on a decoder.
A subtle pitfall is comparing perplexity across architectures. PPL is well-defined for causal LMs. For MLM you can compute a pseudo-PPL by masking each token in turn, but it is not on the same scale and ranking BERT against GPT by PPL is meaningless. Compare on downstream task accuracy under a fixed protocol.
Teams also fine-tune the full model when LoRA would do. Full fine-tuning a 7B needs 80 GB of GPU memory with mixed-precision Adam; QLoRA fits the same model in 24 GB. Full fine-tunes rarely beat LoRA by more than a point or two on standard benchmarks, and a LoRA adapter is megabytes instead of gigabytes — trivial to ship per-tenant. Default to LoRA and only escalate when it hits a measurable ceiling.
Another mistake is using raw [CLS] embeddings for similarity. Stock BERT's [CLS] was trained for NSP, not cosine similarity, and vectors are anisotropic. Use sentence-transformers or a BGE / E5 variant instead.
Finally, candidates mix tokenizers across models. The tokenizer is part of the model — vocab IDs are not portable. Feeding a GPT-2 tokenizer into BERT gives garbage embeddings and silent regressions that nothing crashes on. Always load the tokenizer from the same checkpoint as the model.
Related reading
- Cross-validation strategies for the DS interview
- Bayesian methods on the DS interview
- Explainable AI on the DS interview
- Decision trees on the DS interview
- Feature store on the DS interview
If you want to drill DS questions like this every day, NAILDD is launching with 500+ machine learning and NLP problems pulled from real loops at Meta, OpenAI, and Anthropic.
FAQ
What should I use for production text classification?
A fine-tuned DeBERTa-v3-base or DistilBERT served on a single GPU with ONNX Runtime or TorchServe. You get single-digit ms latency, deterministic outputs, and per-tenant fine-tunes that cost nothing to host. Prompted GPT-4 only wins when you have fewer than a few hundred labeled examples and need shipping by Friday — prompt now, label as you go, migrate to a fine-tuned encoder past a few thousand examples.
Does BERT handle languages other than English?
Vanilla BERT was trained on English Wikipedia and BookCorpus. mBERT covers ~100 languages but is mediocre on most. The strong modern multilingual encoders are XLM-RoBERTa, E5-multilingual, and BGE-m3 — all handle 90+ languages near language-specific quality. For high-stakes non-English apps, fine-tune a multilingual checkpoint on in-language examples and benchmark before committing.
Can I use GPT for named entity recognition?
Yes via prompting with an extraction schema and few-shot examples. For production NER on known entity types, a fine-tuned DeBERTa-v3 token classifier is typically 5-15 F1 points higher at a fraction of the cost. Prompted LLMs win when entity types are open-ended or change weekly.
What is the practical context window?
Stock BERT is 512 tokens — baked into position embeddings. For longer inputs, chunk and aggregate, or use Longformer (4,096 with sliding-window attention) or BigBird (sparse global attention). Modern decoders span 8k to 1M+ tokens depending on rope-scaling. Plan input strategy before picking the model.
LoRA, QLoRA, or full fine-tuning?
Start with QLoRA on a 4-bit base — fits the largest model on whatever GPU you have. Move to LoRA on a non-quantized base if you have the VRAM and need slightly better quality. Escalate to full fine-tuning only with 100k+ labeled examples, a clear ceiling LoRA cannot reach, and budget for full per-task checkpoint storage.
Is this content official?
No. Study guide based on the original BERT (Devlin 2018), GPT (Radford 2018), and Transformer (Vaswani 2017) papers plus Hugging Face docs and public model cards. For benchmark numbers and licensing, check upstream.