ML inference latency optimization for the Data Science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Where latency actually lives

When an interviewer at Meta, OpenAI, or Stripe asks you to optimize an ML service, the wrong move is to jump straight to "quantize to INT8". The right move is to decompose end-to-end latency into stages and only then pick a lever. A request that takes 420 ms end-to-end usually breaks down across network, queueing, preprocessing, model forward, and postprocessing — and the dominant stage is rarely the one juniors assume.

The five stages to name out loud: network round trip, queue wait (relevant the moment you batch), preprocessing (tokenization, image resize, feature lookup), model forward, and postprocessing plus response serialization. A vision model serving 30 RPS on an A100 often spends 18 ms in forward and 45 ms in preprocessing because images are being decoded on the CPU — the fix is not a smaller model, it is moving JPEG decode to NVJPEG on the GPU.

Load-bearing trick: profile before you optimize. Cite p50, p95, p99 separately. "Average latency" is a tell that you have not worked on a real serving stack.

The interviewer is testing whether you can reason about a latency budget the way an SRE reasons about an error budget. A search ranker has maybe 30 ms of total budget. A chat assistant first-token has 300 ms. A batch recommendation job has minutes. Naming the budget first changes every downstream answer.

Optimization techniques at a glance

Get this comparison table on the whiteboard before going deep. Numbers are typical ranges from published benchmarks (Hugging Face Optimum, NVIDIA TensorRT, vLLM, DeepSpeed) — interviewers will not pin you to exact figures, but they notice when your direction of magnitude is wrong.

Technique Latency improvement Accuracy cost Typical use case
INT8 quantization 2–4× faster forward 0.5–1.5% drop CNNs, BERT-class encoders on CPU/GPU
INT4 / GPTQ / AWQ 3–5× faster, 4× smaller 1–3% drop LLMs at 7B–70B on a single GPU
Structured pruning 1.5–2.5× faster 1–4% drop Over-parameterized transformers
Knowledge distillation 2–10× faster (smaller student) 1–5% drop DistilBERT, TinyLlama, custom rankers
Dynamic batching 5–30× higher throughput none (latency trade-off) High-RPS online inference
Continuous batching (vLLM) 10–24× throughput on LLMs none LLM serving with variable lengths
KV cache 5–50× faster per token after first none Any autoregressive decoder
Output / embedding cache 100–1000× on hits none (staleness risk) Repeated queries, semantic search
TensorRT / ONNX Runtime 1.5–3× over vanilla PyTorch none Production GPU serving

The right answer is usually a stack of two or three of these, not a silver bullet. Quantize, then serve with continuous batching, then cache the top-1% of queries — that compounds.

Model size: quantization, pruning, distillation

Quantization swaps high-precision weights and activations for lower-precision integers. FP32 → FP16 is essentially free on modern GPUs and gives roughly 2× speedup with no measurable accuracy loss. FP16 → INT8 (post-training static with calibration data) gives another 1.5–2× at under 1% accuracy. For LLMs the modern recipe is INT4 weight-only via GPTQ or AWQ — a 13B model that needed 26 GB in FP16 fits in 7 GB at INT4 and runs on a single consumer GPU.

Pruning removes weights you do not need. Unstructured pruning compresses storage but rarely speeds up dense GEMM kernels. Structured pruning removes whole attention heads, FFN channels, or layers and translates to real wall-clock wins. Typical recipe: prune 30–50% of heads, fine-tune for a few thousand steps to recover, ship 1.8× faster at 2% accuracy cost.

Distillation trains a small student to mimic a large teacher. DistilBERT is 40% smaller and 60% faster than BERT-base while retaining 97% of GLUE performance. For ranking, a 6-layer student distilled from a 24-layer cross-encoder often hits 95%+ of teacher NDCG@10 with a 5× latency reduction.

Sanity check: always quote the accuracy drop alongside the speedup. An interviewer who hears "4× faster" without "1.2% accuracy loss" assumes you have never run the calibration set.

Batching strategies

A GPU running one request at a time is the most expensive idle hardware in your stack. Static batching waits for a fixed batch_size (say 32) before launching the forward pass — simple, but tail latency suffers because request 1 waits for request 32. Dynamic batching (NVIDIA Triton's default) waits up to a tunable window — 5 ms, 10 ms, 20 ms — and ships whatever it has. Standard for online inference of fixed-shape models like image classifiers.

Continuous batching (vLLM, TGI) is the LLM-era answer. Because autoregressive decoding produces one token at a time and requests have different output lengths, naive batching wastes a huge fraction of GPU cycles on padding. Continuous batching lets new requests join the batch mid-flight as old ones finish. On Llama-class models this typically delivers 10–24× throughput versus static batching at the same p95.

Strategy Best for Throughput win Tail latency risk
Static (size=N) Offline jobs, training-style inference High High — last-in waits for batch fill
Dynamic (max wait W ms) Online CV, ranking, embeddings Moderate–high Bounded by W
Continuous (in-flight) LLM serving, variable-length output Very high Low — preemption keeps tails tight

The interview frame is throughput-per-dollar subject to a p95 latency SLA. If your p95 budget is 200 ms and dynamic batching with a 20 ms window doubles throughput while moving p95 from 110 ms to 145 ms, great. If it pushes p95 to 240 ms, you broke the SLA and the win is illusory.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Caching: KV, features, outputs

Caching is the cheapest speedup if you have any input repetition at all. Four tiers, ordered by interview frequency:

KV cache for autoregressive decoders is non-negotiable. Without it, generating token N re-attends over all N-1 previous tokens — quadratic. With it, each new token costs one attention step over the cached keys and values. A 2048-token generation goes from ~30 seconds to ~3 seconds on the same hardware. PagedAttention (vLLM) makes the cache memory-efficient enough to batch dozens of concurrent generations.

Feature cache sits between your feature store and the model. If a recommender needs 200 features per user and 180 change once a day, pre-compute those and look them up by user_id at request time. This is what Tecton, Feast, and in-house systems at Uber and DoorDash do — turning a 40 ms fetch into a 3 ms Redis GET.

Embedding cache for retrieval and semantic search — the same query text always produces the same embedding. Cache by a hash of the normalized query string with a TTL appropriate to your model version. On a search system with a heavy long tail of repeated queries, an embedding cache can absorb 30–60% of encoder load.

Output cache for deterministic pipelines. If your fraud scoring API is pure and called repeatedly for the same transaction, the response is cacheable. Every output cache key must include the model version, or you ship stale answers after a deploy.

Gotcha: invalidation. Every cache layer needs a story for what happens on model deploy, feature drift, or upstream schema change. "We cache for 24 hours" is a vague answer; "we key by model_version and invalidate on deploy via a Redis FLUSHDB on the namespace" is a hire signal.

Hardware and serving stack

Hardware choice is rarely the most interesting answer, but you must be fluent in the trade-offs. CPUs are fine for small models — sub-100M parameters, low QPS. GPUs dominate everywhere else; an H100 is roughly the throughput of an A100 for transformer inference at maybe the rental cost on AWS or Lambda Labs. TPUs are Google-only and shine for very large dense models with JAX. Edge accelerators (Apple Neural Engine, Qualcomm Hexagon, Coral) matter on-device — privacy, offline, or sub-10 ms latency.

The serving layer matters as much as the chip. TensorRT and ONNX Runtime apply fused kernels and graph optimization vanilla PyTorch leaves on the table — expect 1.5–3× on the same GPU. vLLM and TGI are standard for LLM serving. Triton Inference Server runs ONNX, TensorRT, PyTorch, and TensorFlow side by side with shared dynamic batching.

Layer Pick when
CPU + ONNX Runtime Small model, low RPS, cost-sensitive
GPU + TensorRT Online CV/NLP, fixed shapes, latency SLA
GPU + vLLM LLM serving, variable-length generation
GPU + Triton Multi-model, multi-framework production
Edge (NPU / mobile) On-device, privacy, offline

Common pitfalls

Candidates routinely confuse throughput and latency. Increasing batch size raises throughput and raises per-request latency at the same time — the GPU finishes more work per second, but each request waits longer. If the interviewer asked for lower p95 and you describe a fix that doubles batch size, you answered the wrong question. Anchor every proposal to a latency SLA or a cost-per-query target and say which one out loud.

Another trap is optimizing the wrong stage. Quantizing a model whose forward pass is 12 ms when end-to-end p95 is 380 ms saves maybe 6 ms while leaving 360 ms of preprocessing, network, and feature fetch untouched. Profile first, optimize second — strong candidates pull up torch.profiler or NVIDIA Nsight and ask which stage dominates before proposing a lever.

A subtle pitfall is ignoring tail latency when you batch. Dynamic batching with a 20 ms wait window improves median latency through better GPU utilization, but the first request in a new batch window pays close to the full 20 ms of wait time. If your service has a strict p99 SLA, that 20 ms wait must be accounted for in the budget. Quote p50, p95, and p99 separately and admit batching trades p99 for throughput.

Confusing weight quantization with activation quantization also costs candidates. INT4 weight-only (GPTQ, AWQ) keeps activations in FP16 and almost always preserves accuracy. Full INT8 static quantization, where both weights and activations are quantized using a calibration set, is more aggressive and can introduce accuracy regressions on tasks with long-tailed distributions. Name which kind you mean.

Finally, forgetting cold start in serverless or autoscaled setups. The first request after a scale-up can take 5–30 seconds while the model loads onto the GPU and the engine warms up. The fix is keeping a minimum of warm replicas or using a long-running serving framework instead of per-request invocation.

If you want to drill ML systems questions like this every day, NAILDD has a growing library of Data Science interview problems where these trade-offs come up over and over.

FAQ

Should I quote specific latency numbers in the interview?

Yes, but always paired with context. "INT8 quantization gives 2–4× speedup with under 1% accuracy drop on ImageNet-style classification" is far stronger than "quantization makes it faster". Interviewers at Meta, Google, and Anthropic look for candidates who reason in numbers — order-of-magnitude is enough, you do not need to memorize exact benchmarks. If unsure, give a range and say "in my last project" or "from the TensorRT release notes".

When is CPU inference actually a good answer?

When your model is small (sub-100M parameters), traffic is low or spiky, and you care about cost or avoiding GPU cold-start penalties. Modern CPUs with AVX-512 and ONNX Runtime can serve BERT-base at 30–50 ms per request — fine for many internal tools. CPU is also right for edge deployment where no GPU exists. The wrong place to pick CPU is high-RPS production transformer serving.

How do I decide between dynamic batching and continuous batching?

Dynamic batching is for fixed-shape models — image classification, text classification, embeddings — where every request takes roughly the same compute. Continuous batching is for variable-length autoregressive generation, primarily LLMs. For a ranker or image model: "dynamic batching with Triton". For Llama or Mistral: "continuous batching with vLLM or TGI". Mixing them up is a tell that you have read about LLM serving but not run it.

Does distillation always help over just using a smaller pre-trained model?

Not always. If a strong small model exists off the shelf — DistilBERT, TinyLlama, MobileNet — and your task is close to its pre-training distribution, just use it. Distillation pays off when your task has labelled data the small model has not seen and when the accuracy gap between teacher and off-the-shelf small model is large. A practical heuristic: try the off-the-shelf small model first, measure the gap, and only invest in distillation if the gap exceeds 3–5 percentage points on your target metric.

What is the single biggest mistake candidates make here?

Jumping to a technique before profiling. Strong candidates spend two or three minutes decomposing latency into stages and asking about the SLA, traffic pattern, model size, and deployment target. Only then do they propose a lever. Weak candidates lead with "I would quantize and batch" without establishing whether the bottleneck is in the forward pass. The signal is systems thinking, not memorized technique names.