Transformer architecture in the DS interview
Contents:
Why interviewers love Transformer questions
Since 2018, the Transformer has been the substrate of modern NLP, and since 2022 the substrate of every generative system worth shipping — GPT-4, Claude, Gemini, Llama. Walk into a DS loop at OpenAI, Anthropic, Meta, or a frontier startup in 2026 and you will be asked to explain attention on a whiteboard. Junior gets the intuition question, mid-level gets math and shapes, senior gets optimization — KV-cache, FlashAttention, MoE routing.
The failure mode is predictable: candidate writes "fine-tuned BERT" on their resume, the interviewer asks "what is inside the encoder block?", and the answer collapses into "self-attention and some normalization". To survive an onsite, internalize three things: the shapes of Q, K, V, the 1/sqrt(d_k) scale, and the O(n^2) sequence cost that explains every long-context trick.
High-level architecture
The original Transformer (Vaswani et al., 2017) was an encoder-decoder for machine translation — a stack of identical blocks on each side:
[input tokens] -> [Encoder stack] -> [memory]
|
v
[output tokens] -> [Decoder stack] -> [output logits]Each block is the same four-step sandwich:
x -> Attention -> Add & LayerNorm -> FFN -> Add & LayerNorm -> x'The attention sub-layer mixes information across positions. The FFN is a two-layer MLP applied independently per position, with hidden size 4 * d_model and a GeLU activation. Residual plus LayerNorm is what makes 24-, 96-, and 120-layer stacks trainable.
Load-bearing trick: residual + LayerNorm + the 1/sqrt(d_k) scale together turn attention from a numerically fragile operation into something you can stack 100 layers deep. Drop any one and training diverges within a few hundred steps.
In production today, three flavors matter:
- Encoder-only — BERT, RoBERTa, DeBERTa. Classification, NER, retrieval embeddings.
- Decoder-only — GPT, Llama, Claude, Mistral, Qwen. Generation, chat, code, agents.
- Encoder-decoder — T5, BART, mT5. Translation, summarization, structured rewriting.
Decoder-only dominated 2022-2026 because instruction tuning made them general enough to absorb every encoder-only task, but encoder-decoder still wins where input and output vocabularies diverge sharply.
Self-attention math
Self-attention lets every token look at every other token and pull in weighted information. Interviewers expect you to derive it from scratch in four steps.
# X: (batch, seq_len, d_model)
Q = X @ W_Q # (batch, seq_len, d_k)
K = X @ W_K # (batch, seq_len, d_k)
V = X @ W_V # (batch, seq_len, d_v)
scores = Q @ K.transpose(-1, -2) / math.sqrt(d_k) # (batch, seq_len, seq_len)
weights = softmax(scores, dim=-1)
output = weights @ V # (batch, seq_len, d_v)These four shapes are all you must memorize.
Why divide by sqrt(d_k)? For random Q and K with unit variance, the dot product has variance d_k. Past d_k = 64 the softmax saturates, gradients vanish, and the attention map turns into a near one-hot. Scaling by 1/sqrt(d_k) keeps the pre-softmax distribution unit-variance regardless of head size.
Self vs cross attention. Self-attention pulls Q, K, V from the same sequence. Cross-attention — in the decoder block of an encoder-decoder — pulls Q from the decoder and K, V from encoder memory.
Causal mask. In a decoder, position t must not see positions > t, otherwise the model cheats during next-token training. The mask is a -inf upper triangle added to scores before softmax. Forgetting it is the most common autoregressive training bug.
Multi-head attention and parameter budget
A single head expresses one relation pattern. Multi-head attention runs 8 to 64 heads in parallel, each with its own Q, K, V projection of width d_model / h, then concatenates and projects through W_O:
head_i = Attention(Q W_Qi, K W_Ki, V W_Vi)
MultiHead = Concat(head_1, ..., head_h) @ W_OBecause each head is d_model / h wide, total FLOPs equal those of a single full-width head. You buy expressivity, not compute.
Here is the per-layer parameter breakdown interviewers ask about when sizing a model. Assume d_model = 1024, h = 16, FFN hidden 4096, no biases:
| Sub-layer | Matrices | Shape per matrix | Total params per layer |
|---|---|---|---|
| Q, K, V projections | 3 | d_model x d_model |
3 * 1024 * 1024 = 3.15M |
| Output projection W_O | 1 | d_model x d_model |
1.05M |
| FFN up-projection | 1 | d_model x 4*d_model |
4.19M |
| FFN down-projection | 1 | 4*d_model x d_model |
4.19M |
| LayerNorm (x2) | 2 | d_model scale + bias |
~4k |
| Per-layer total | ~12.6M |
A 24-layer model gives roughly 300M parameters in the blocks alone, plus embedding and unembedding matrices of vocab_size x d_model. BERT-base lands at 110M with d_model = 768 and 12 layers. GPT-3 lands at 175B with d_model = 12288 and 96 layers. The pattern scales quadratically in width and linearly in depth.
Positional encoding variants
Attention is permutation-invariant — shuffle the tokens and the output set is the same. Since "the dog bit the man" is not the same sentence as "the man bit the dog", you have to inject position information somewhere.
Sinusoidal (2017 original) adds a fixed sin/cos pattern to the input embedding, with frequencies decaying as 10000^(2i / d_model).
Learned absolute (BERT, GPT-2) trains a position vector for each index from 0 to max_len. Simple, but extrapolation past max_len is poor.
RoPE — rotary position embedding — rotates Q and K by an angle proportional to position before the dot product. Llama, Qwen, and GPT-NeoX use it because it preserves relative-position information inside the score and extrapolates further with YARN or NTK scaling tricks.
ALiBi adds a linear bias to scores proportional to the distance between query and key. Used by Mosaic and a few research models for ultra-long context without retraining.
The classic question "you trained on 2k tokens and want to serve 32k — what do you do?" has a clean answer: switch to RoPE with YARN or ALiBi, then validate on long-context evals.
Encoder vs decoder vs encoder-decoder
The three families share the same building block but differ in masking and pretraining objective. Interviewers map them to use cases.
BERT family (encoder-only). Bidirectional self-attention — every token sees every other token. Pretrained with masked LM: replace 15% of tokens with [MASK] and predict them. Used for classification, NER, and retrieval via the [CLS] or pooled embedding.
GPT family (decoder-only). Causal self-attention — each token sees only the past. Pretrained with next-token prediction. Used for chat, code, agents. Instruction tuning plus RLHF turns the base model into a general assistant.
T5 / BART family (encoder-decoder). Encoder reads the input, decoder generates the output with cross-attention into encoder memory. Pretrained with span corruption. Used for translation, summarization, and structured rewriting.
For a deeper side-by-side, see BERT vs GPT in the Data Science interview.
Attention complexity and FlashAttention
The single most useful table in the entire post. Interviewers draw this when probing for senior-level depth — "how does FlashAttention change the asymptotics, and how does it change the constant?"
| Variant | Compute (FLOPs) | Memory (activations) | Quality vs vanilla | When to use |
|---|---|---|---|---|
| Vanilla attention | O(n^2 * d) |
O(n^2) |
baseline | Short context, n < 2k, reference implementation |
| FlashAttention 2 / 3 | O(n^2 * d) |
O(n) |
identical | Default for n in 2k-128k on A100, H100, B200 |
| Sparse / local attention | O(n * w * d) |
O(n * w) |
lossy, task-dependent | Longformer, BigBird, very long docs with local structure |
| Linear attention | O(n * d^2) |
O(n * d) |
lossy, often worse | Performer, Linformer, niche use cases |
The key insight is that FlashAttention does not change asymptotic FLOPs. It changes the memory hierarchy: by tiling Q, K, V into blocks that fit in GPU SRAM and streaming softmax in a numerically stable pass, it eliminates the O(n^2) materialization of the attention matrix in HBM. On an H100 this gives a 2-4x wall-clock speedup and lets you fit context lengths that would otherwise OOM. Sparse and linear approximations do change asymptotics but pay for it in quality, which is why every frontier model in 2026 still runs full attention plus FlashAttention.
Gotcha: "FlashAttention is O(n) attention" is wrong and gets candidates dinged at senior level. It is still O(n^2) compute. It is O(n) activation memory.
The related KV-cache question: during autoregressive generation the K and V of past tokens are fixed, so caching them turns per-token inference from O(n^2) to O(n). Cache size at fp16 is 2 * L * h * d_k * seq_len * 2 bytes. For Llama-70B at 32k context this is roughly 20 GB, which is why long-context serving is memory-bound, not compute-bound. See GPT architecture in the Data Science interview for more.
Common pitfalls
The common failure mode is conflating self-attention with the full Transformer block. Self-attention is the headline operation, but a stack of bare attention layers does not converge — you need residual connections, LayerNorm, and the position-wise FFN to stabilize training and add non-linearity. Candidates who say "Transformer equals self-attention" lose the senior signal immediately. Always sketch the full block: attention, add-and-norm, FFN, add-and-norm.
A second trap is describing softmax(QK^T)V as "average pooling". It is a learned weighted aggregation, where the weights themselves are a softmax over learned projections of the input. Average pooling has no parameters and cannot adapt to context. Calling attention average pooling signals that you do not understand why attention works — the weights are a function of the input.
A third trap is ignoring the KV-cache at inference. Without it, each generated token recomputes attention over the entire prefix, making per-token cost O(n^2) instead of O(n). On a 32k chat workload this is the difference between a usable assistant and a five-minute response. Every serving stack — vLLM, TGI, SGLang, TensorRT-LLM — implements KV-cache by default, and interviewers expect you to bring it up when asked "how do you serve a 70B model under 200ms first-token latency".
A fourth trap is mixing up training context and serving context. A model trained on 4k tokens with learned absolute positions produces garbage past position 4k at serving time. The fix depends on the encoding: RoPE plus YARN extrapolates cleanly, ALiBi by design, learned absolute requires continued pretraining on long context.
A fifth trap, increasingly common in 2026, is reaching for BERT for every classification problem. A zero-shot prompt to a frontier LLM or an embedding-plus-logistic-regression pipeline using bge-large, e5-mistral, or gte-Qwen usually beats a fine-tuned BERT-base with no training data required. BERT-style fine-tuning still wins when you have 100k+ labels and need sub-10ms CPU latency, but that is a narrower niche than candidates assume.
Related reading
- BERT vs GPT in the Data Science interview
- GPT architecture in the Data Science interview
- NLP for the Data Science interview
- Deep learning for the Data Science interview
If you want to drill ML and DS system-design questions like this every day, NAILDD is launching with curated interview problems across exactly this pattern.
FAQ
How is GPT different from BERT in one sentence?
GPT is decoder-only with causal attention, pretrained on next-token prediction and optimized for generation. BERT is encoder-only with bidirectional attention, pretrained on masked language modeling and optimized for understanding and embeddings. The architectural difference is the mask; the objective difference is what each one is good at.
What is FlashAttention and why does it matter?
FlashAttention is a memory-aware implementation of exact attention that tiles Q, K, V into blocks fitting in GPU SRAM and computes softmax in a numerically stable streaming pass. It produces the same output as vanilla attention but with O(n) activation memory and a 2-4x wall-clock speedup on modern GPUs. Every frontier model in 2026 ships with FlashAttention 2 or 3 as the default kernel.
Why did Transformers replace RNNs?
Two reasons. RNNs are sequential — token t requires the hidden state from t-1 — which is slow on GPUs that prefer wide parallel matmuls. Transformers process the whole sequence in parallel at training time. Attention also models long-range dependencies more reliably than an RNN hidden state, which compresses all past context into a fixed-size vector and forgets the early tokens.
What is the KV-cache and when is it used?
During autoregressive decoding the keys and values of past tokens are immutable, so a serving system caches them and only computes K, V, Q for the newest token. This drops per-token cost from O(n^2) to O(n) and is essential for low-latency chat. Cache size grows linearly with context length and is the dominant memory term in long-context serving.
What is Mixture of Experts and how does it relate to Transformers?
In an MoE Transformer the FFN is replaced with N parallel experts plus a router that picks the top k (usually 1 or 2) per token. This gives a very large total parameter count but a much smaller active parameter count per token, so inference cost stays near a dense model of the active size. GPT-4, Mixtral, and DeepSeek-V3 use this pattern.
Is this an official reference?
No. The post is a study guide synthesized from "Attention Is All You Need" (Vaswani et al., 2017), the FlashAttention papers, public model cards for BERT, GPT, Llama, and Mistral, and the Hugging Face documentation. Cite the primary sources in your own writing.