Deep learning for the data science interview
Contents:
Why DL shows up in DS interviews
Almost every DS posting at Google, Meta, Netflix, Stripe, or Airbnb lists deep learning under "nice to have" — and for teams touching CV, NLP, recsys, or fraud, it quietly moves to "required". The bar is not "publish at NeurIPS". The bar is explain a forward and backward pass without notes, name an optimizer and say why, and recognize when a 50M-parameter network is overkill for a 5,000-row tabular dataset. That last judgement call separates a junior who quotes blog posts from a middle DS who has shipped models.
Depth scales with the team. Product analytics groups want the basics: what a neural net is, how it differs from logistic regression, why a transformer beats an LSTM on long text. Applied CV or NLP teams will grill you on batch normalization quirks, AdamW vs Adam, learning rate warmup, and why your loss exploded at epoch 3. Research teams at OpenAI or Anthropic go further — recent papers, scaling laws, parallelism tricks. Same vocabulary at every level; only the depth changes.
This primer covers the minimum viable surface area for a junior-to-middle DS round: MLP, CNN, RNN, Transformers, GNN, regularization, optimizers, and when to walk away from deep learning entirely.
Architectures by use case
Interviewers love the question "what architecture would you pick?" because it tests whether you understand the data, not just the model zoo. The honest answer is a small table of defaults.
| Data shape | Default architecture | Why | Common variant |
|---|---|---|---|
| Tabular, ≤1M rows | Gradient boosting (not DL) | Better signal-to-noise, interpretable | LightGBM, XGBoost, CatBoost |
| Images, video frames | CNN or Vision Transformer | Spatial locality, translation invariance | ResNet, EfficientNet, ViT |
| Sequential text, audio | Transformer (encoder or decoder) | Long-range attention, parallelizable | BERT, GPT, T5 |
| Short sequences, embedded edge | LSTM or GRU | Smaller footprint, no attention quadratic cost | Bidirectional LSTM |
| Graph data (social, molecules) | GNN | Message passing across edges | GCN, GraphSAGE, GAT |
CNNs still dominate when you need fast edge inference — a ViT needs more data and compute to match accuracy on a 224×224 classifier. RNNs are mostly legacy now, but interviewers still ask about LSTM cell state and gates because the answer reveals whether you understand vanishing gradients. GNNs are the rising star in fraud, recommendations, and biology; if you interview at Uber, DoorDash, or any payments team, expect at least one message-passing question.
Load-bearing rule: Match architecture to data shape first, dataset size second, latency budget third. Pick the smallest model that clears your accuracy target — every extra parameter is a future debugging session.
MLP, the often-forgotten baseline
A multilayer perceptron is the boring default — input, two or three hidden layers, output, all fully connected. It rarely wins on raw tabular data, but it shows up everywhere as a head on top of pretrained embeddings. Every transformer ends in an MLP block; every recsys two-tower model is two MLPs glued at the dot product. If you cannot derive its backward pass on a whiteboard, do not move on yet.
CNN, the spatial workhorse
Convolutional filters slide over the input, share weights, and exploit the fact that a cat's ear looks the same in the top-left and bottom-right of an image. Pooling downsamples; the receptive field grows with depth. ResNet introduced residual connections that let networks reach 152 layers without vanishing gradients — a trick now used everywhere, including inside transformers.
Transformer, the new default
Self-attention lets every token look at every other token in one step. Multi-head attention runs several such computations in parallel, each learning a different relationship. The feed-forward block per position is an MLP. Layer normalization plus residual connections keep gradients flowing. Memorize the attention formula: softmax(QK^T / sqrt(d_k)) V.
GNN, the relational specialist
A graph neural network passes messages along edges and aggregates them at each node. For fraud, a single transaction is much less informative than the device-account-IP graph it sits in. GraphSAGE samples neighborhoods for scalability; GAT learns attention weights over neighbors. Expect "why not flatten the graph into features?" — neighborhood structure carries signal you cannot tabulate without losing higher-order paths.
How neural networks train
Training is forward pass, loss, backward pass, weight update — repeated millions of times. Backpropagation is the chain rule applied to a computation graph; autograd handles it. Interviewers still ask because the failure modes — vanishing gradients, exploding gradients, dead ReLUs — only make sense if you understand what flows backward.
The classic question is "what is a vanishing gradient and how do you fix it?" Clean answer: gradients shrink exponentially through many layers, especially with saturating activations like sigmoid. Fixes are ReLU or GELU activations, batch normalization, residual connections, and careful weight initialization (He or Xavier).
Optimizer choice matters more than you think
The optimizer is the loop that turns gradients into weight updates. The defaults have shifted twice in the last decade.
| Optimizer | When to use | Pitfall |
|---|---|---|
| SGD | Computer vision with long schedules, when you have time to tune | Needs careful learning rate, slow without momentum |
| SGD + Momentum | Same, plus smoother convergence on noisy gradients | Still needs schedule tuning |
| Adam | General-purpose default, NLP, recsys, fast prototyping | Weight decay is implemented incorrectly |
| AdamW | Transformers, anything with regularization | None major; this is the modern default |
| Lion | Recent (2023+), strong for large transformers, memory-light | Less battle-tested, fewer recipes online |
The Adam-to-AdamW switch is small but load-bearing. Standard Adam couples weight decay to the adaptive learning rate, weakening it on parameters with large gradients. AdamW decouples them. For any modern transformer, AdamW is mandatory, not optional.
Learning rate schedules
A constant learning rate is almost always wrong. Three patterns to know cold: step decay (drop by a factor every N epochs, common in CV), cosine annealing (smooth cosine curve), and warmup plus cosine (linear ramp from zero for the first 1-5% of steps, then cosine decay — the transformer standard). Warmup matters because attention layers start with random Q, K, V projections and a high initial learning rate produces gradient explosions in the first few hundred steps.
Regularization
Deep networks overfit unless you fight back. Five techniques cover most cases.
Dropout randomly zeroes a fraction p of neurons during training (typically p = 0.1 to 0.5). The network cannot rely on any single path, so it learns redundant representations. Off at inference, outputs scaled. Forgetting to switch to eval mode ships to production at least once per career.
Batch normalization standardizes activations within each batch and stabilizes training. The gotcha is small batch sizes: with batch size 4 or 8, statistics are too noisy. Use Layer Normalization instead — it normalizes across features per sample, which is why transformers use it exclusively.
Weight decay adds an L2 penalty on weights, implemented correctly in AdamW. Typical values are 0.01 to 0.1 for transformers, 1e-4 for CNNs.
Data augmentation beats every other regularizer when feasible. For images: random crops, flips, color jitter, MixUp, CutMix. For text: back-translation, synonym swap, span masking.
Early stopping halts training when validation loss stagnates for a patience window (commonly 5-10 epochs). Cheap and robust.
Transformers in 5 minutes
The transformer block has three parts: multi-head self-attention, a position-wise feed-forward network, and two residual-plus-LayerNorm wrappers. Positional encodings inject token order, since attention itself is permutation-invariant.
Three families dominate. BERT is encoder-only, trained with masked language modeling, used for classification, NER, and QA. GPT is decoder-only with causal masking, trained as a next-token predictor, used for generation. T5 is encoder-decoder and frames every task as text-to-text. Modern LLMs (GPT-4, Claude, Llama) are decoder-only with refinements like rotary positional embeddings, grouped-query attention, and RMSNorm — the core block is unchanged from 2017.
If the interviewer goes deeper, walk through one head: project input into queries, keys, values; compute QK^T / sqrt(d_k); mask if causal; softmax over keys; multiply by V; concatenate heads; project back.
When DL is the wrong tool
A model that loses to LightGBM but trains for 3 days is a bad model. Skip DL when:
- Tabular data under 1M rows with standard features — gradient boosting wins on accuracy, training time, and interpretability. Every Kaggle tabular leaderboard for the last six years confirms this.
- A linear or logistic regression already meets your business metric.
- Tiny datasets (under ~10k samples for vision, under ~1k for text) without transfer learning — the model will memorize.
- Hard interpretability requirements (credit scoring, medical triage) — SHAP on a boosted tree is easier to defend than gradient attribution on a 100M-parameter net.
DL earns its keep on images, audio, free-form text, very large datasets, and problems with rich structure (graphs, sequences, multi-modal).
Common pitfalls
The first pitfall is forgetting to normalize inputs. Neural networks expect inputs in a small range around zero. Feed raw pixel values 0-255 or unscaled tabular features and your loss either diverges immediately or trains painfully slowly. Standardize tabular features to zero mean and unit variance, scale images to [0, 1] or [-1, 1], and the first epoch will look dramatically saner.
A second trap is using sigmoid or tanh in hidden layers because someone learned them first in a textbook. Both saturate; both cause vanishing gradients in deep networks. Use ReLU as a default, GELU for transformers, SiLU for modern CNNs. Reserve sigmoid for binary output heads.
The third pitfall is shipping a model without dropout, batch norm, or weight decay, then wondering why train accuracy is 99% and validation is 70%. Pick at least one regularizer from each category — stochastic (dropout), normalization (BN or LN), and explicit penalty (weight decay). Combine with augmentation and early stopping; generalization gaps shrink dramatically.
A fourth common error is using Adam where AdamW belongs. For transformers, Adam with weight decay 0.01 effectively applies less than half the regularization you think, because decay is rescaled by the adaptive learning rate. AdamW decouples them. Always AdamW for transformers.
The fifth pitfall is tiny batch sizes without gradient accumulation. A batch of 4 breaks batch norm and confuses the optimizer. If memory forces small batches, accumulate gradients across mini-batches before stepping, or switch to LayerNorm.
Related reading
- BERT vs GPT for data science interviews
- Bayesian neural networks interview
- Bayesian optimization interview
- Cross-validation strategies for data science interviews
- Feature store for data science interviews
If you want to drill deep learning, ML system design, and SQL questions every day, NAILDD is launching with 500+ interview problems across exactly these patterns.
FAQ
PyTorch or TensorFlow?
PyTorch is the default for research, almost every Hugging Face model, and most production teams. TensorFlow remains common in mature ML platforms at Google and in mobile or edge stacks via TFLite. Choosing a first framework, learn PyTorch — most papers and tutorials use it, and the API is closer to numpy.
How much hands-on DL do I need for a junior DS role?
One or two end-to-end projects with a real neural network — a CNN image classifier on a non-trivial dataset, a fine-tuned BERT on a text classification task, or a recsys with embeddings. You should describe the architecture, the loss, the optimizer, the regularization, and one bug you hit.
Batch norm vs layer norm?
Batch normalization standardizes each feature across the batch dimension — it needs a reasonable batch size to estimate statistics. Layer normalization standardizes across the feature dimension per sample, so batch size does not matter. Transformers use layer norm because attention is sensitive to scale and batch norm breaks at the small batch sizes typical for long sequences.
Why do transformers use warmup?
At initialization, attention layers have random Q, K, V projections, so attention weights are nearly uniform and the gradient signal is large and noisy. A high initial learning rate causes loss spikes. Linear warmup over the first 1-5% of steps lets the optimizer ease in until AdamW's running statistics stabilize, then cosine decay takes over.
Is DL expected at every DS interview?
No. Product analyst and experimentation-focused roles often skip it or ask one screening question. Applied ML, NLP, CV, recsys, and ranking roles go deep — expect architecture questions, training-loop debugging, and at least one "how would you build X" system design.
Is this official guidance?
No. This primer reflects public material — original papers (LeCun, He, Vaswani, Kipf), framework docs, and patterns reported by candidates. Use it as scaffolding, not a substitute for hands-on practice.