Adversarial attacks for the Data Science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What is an adversarial attack

An adversarial attack is a tiny, often imperceptible perturbation added to a model's input that flips the prediction with high confidence. The classic demo: take a photo of a panda, add noise scaled to about 1/255 per pixel, and a ResNet that was 57% confident about "panda" is now 99% sure it sees a "gibbon" — the image still looks like a panda to any human grader. This is a structural property of high-dimensional decision boundaries discovered by Szegedy et al. in 2013, and it shows up in image classifiers, speech models, tabular fraud detectors, and LLMs alike.

In a Data Science interview, this topic gets asked for three reasons. First, it tests whether you understand gradients flowing back to the input space, not just to the weights. Second, it is a proxy for production maturity — ship a fraud model without thinking about adversarial behavior and your false-negative rate gets gamed within weeks. Third, recent LLM safety work makes it a hot topic again: jailbreak prompts are a discrete-token cousin of FGSM, and frontier labs increasingly want a real answer.

The one-line answer to remember: an adversarial example is an input crafted so the model's loss is maximized inside a small perturbation budget around a clean input — usually an L-infinity ball of radius epsilon.

The rest of this post fills in the threat models you will be quizzed on and the talking points that separate a "memorized the slides" answer from a "shipped a robust model" answer.

Threat models interviewers test

Before you write any math, frame the threat model. Interviewers care less about which attack you name than whether you can scope the problem. A good answer specifies four things: goal, knowledge, perturbation budget, and access pattern.

Dimension Options What it changes
Goal Untargeted vs targeted Targeted attacks need a specific wrong class — harder, slower
Knowledge White-box, gray-box, black-box Drives whether attacker has gradients or only queries
Budget L-infinity epsilon, L2 norm, or perceptual Standard ImageNet benchmark is epsilon = 8/255 in L-infinity
Access One-shot vs adaptive vs query-limited Production attackers usually get ~1,000 queries before rate limits

If a candidate says "I would defend with adversarial training" without naming the threat model, a senior interviewer pushes back. Adversarial training is brittle outside the budget you trained against — train at epsilon = 8/255 and the model is still owned by an attacker who picks epsilon = 16/255 or switches the norm.

This is also why "robustness" is never a single number — it is a curve over budgets, attacks, and norms.

FGSM — the one-step attack

The Fast Gradient Sign Method is the simplest and most-asked attack. You compute the gradient of the loss with respect to the input, take its sign, scale it by the perturbation budget, and add it to the input.

x_adv = x + epsilon * sign(grad_x L(f(x), y))

Three things matter when you explain it. The sign instead of the raw gradient gives you a step inside the L-infinity ball — every pixel moves exactly epsilon. The attack is single-step, fast but weak; on a well-trained model it succeeds 40-70% of the time at epsilon = 8/255. And the gradient is with respect to the input tensor, not weights — interviewers love asking candidates to write the autograd code, because it exposes whether you actually use PyTorch in anger.

import torch
import torch.nn.functional as F

def fgsm(model, x, y, epsilon):
    x = x.clone().detach().requires_grad_(True)
    logits = model(x)
    loss = F.cross_entropy(logits, y)
    loss.backward()
    return torch.clamp(x + epsilon * x.grad.sign(), 0, 1).detach()

The clamp at the end is the detail candidates skip — without it your adversarial pixels leave the [0, 1] valid image range, and the attack succeeds against a tensor that no camera could ever produce. That answer alone gets you a nod from an experienced ML interviewer.

PGD — the strong baseline

Projected Gradient Descent is FGSM run iteratively with projection back into the epsilon-ball after each step. It is the de-facto strongest first-order attack, the one Madry et al. used in 2017 to show that adversarial training against PGD generalizes to most weaker attacks.

For step k in 1..K:
    x_adv = clip(x_adv + alpha * sign(grad_x L(f(x_adv), y)),
                 lower = x - epsilon,
                 upper = x + epsilon)
    x_adv = clip(x_adv, 0, 1)

Standard hyperparameters on CIFAR-10 are K = 20 steps, alpha = 2/255, epsilon = 8/255, often with random initialization inside the epsilon-ball to escape gradient masking. Interviewers will ask why PGD and not Carlini-Wagner or AutoAttack. The honest answer: PGD is cheap, well-understood, and reproducible across labs. CW is stronger but its soft-margin objective does not project cleanly onto a budget. AutoAttack — an ensemble of PGD variants plus a query-based attack — is the 2020+ gold standard if you have the compute.

Load-bearing trick: restart PGD from multiple random points inside the epsilon-ball. A single trajectory can stall in a flat region; 10 random restarts routinely add 5-15 points of attack success.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Black-box and transfer attacks

The interesting interview question is what happens when the attacker has no gradients. This is the production case — your model sits behind an API, an attacker can only send inputs and read predictions. Three patterns dominate.

Query-based attacks estimate gradients numerically by perturbing one feature at a time, then run a black-box PGD. NES and SPSA are the named variants. They need 5,000 to 100,000 queries per example, which is why production defenses focus on rate-limiting and anomaly detection on the query stream.

Transfer attacks are the cheaper, scarier cousin. The attacker trains a surrogate on similar data, crafts adversarials against the surrogate with PGD, and submits them to your model. On ImageNet, transfer rates between independently-trained ResNets sit around 40-60% without tuning, and ensemble surrogates push that above 80%. Even if your weights are locked in a vault, an attacker who knows your training distribution owns a large fraction of your inputs.

Decision-based attacks like Boundary Attack assume only access to the predicted class, no probabilities, and walk along the decision boundary by rejection sampling. Slow but catastrophic for systems that hide softmax outputs as a defense.

The takeaway: "hiding the model" is not a defense. Either build robustness into training, or accept that the attack surface includes anyone with API access.

Defenses that actually hold up

Most published defenses have been broken within months of release. The literature is a graveyard. Three categories survive scrutiny in 2026.

Adversarial training remains the strongest empirical defense. You generate PGD adversarials inside the training loop and train on a mix of clean and perturbed examples. The cost is real: training takes 3-7x longer, and clean accuracy drops by 2-10 percentage points. The benefit is real too — robust accuracy at epsilon = 8/255 on CIFAR-10 went from roughly 0% on a vanilla ResNet to 52-65% on Madry-style adversarially-trained models, and another 5-10 points with TRADES and MART variants.

Certified robustness via randomized smoothing gives mathematical guarantees instead of empirical numbers. Add Gaussian noise of standard deviation sigma to the input, take the majority vote over many samples, and Cohen et al. (2019) prove the smoothed classifier is robust within a radius proportional to sigma. Certified radii are tight only for L2 perturbations, and inference costs rise by 100-1000x.

Input transformations and detection are the third bucket, and the honest answer is that they mostly work only against non-adaptive attackers. JPEG compression, random resizing, total-variation denoising — all defeated by Athalye et al.'s "Obfuscated Gradients" paper in 2018, which broke seven ICLR defenses in one shot.

Defense Threat covered Clean acc cost Inference cost Status
Adversarial training (PGD) Empirical L-inf 2-10 pp drop 1x Best practical option
TRADES / MART Empirical L-inf 2-8 pp drop 1x Marginal gains over PGD-AT
Randomized smoothing Certified L2 5-15 pp drop 100-1000x Use for safety-critical
Input preprocessing Non-adaptive only Low Low Broken by adaptive attacks
Defensive distillation None Low 1x Broken in 2016 — do not mention as a defense

If you want one rule of thumb for a system design answer: train adversarially against PGD at the budget you actually care about, monitor query streams in production, and rate-limit aggressively. That is the architecture shipping at frontier labs today.

Common pitfalls

The first pitfall is claiming robustness without an adaptive attack. A model can look robust against FGSM and still fall over to PGD with restarts. Worse, it can resist all standard attacks because of gradient masking — a numerical artifact where gradients vanish or explode near the input, fooling the attacker but not changing the actual decision boundary. The fix is to evaluate with AutoAttack and a transfer attack from a separate model; if robust accuracy collapses under transfer, your defense is masking, not robustness.

The second pitfall is conflating perceptual and L-infinity budgets. An L-infinity epsilon of 8/255 looks meaningless on natural images but is dramatic on medical scans where pixel intensities encode tissue density. Interviewers in healthcare ML will press on this — your threat model should be domain-specific. In radiology, even a 1% perturbation can be clinically meaningful, so the right budget is closer to epsilon = 1/255 with much stricter detection layered on top.

The third pitfall is assuming adversarial training generalizes to new attack types. Train against L-infinity PGD and you do not get free robustness against L2 attacks, spatial transformations, or patch attacks. Each threat model needs its own training run, or a careful multi-norm training scheme like that of Tramèr and Boneh (2019). The honest answer in an interview is that robustness is plural — one number on one benchmark is rarely enough for a real product.

The fourth pitfall is ignoring the clean-vs-robust trade-off in product terms. A 5-point clean accuracy drop on a fraud detector is a measurable false-positive cost. Show the interviewer you have done the math: estimate the cost of one false positive, the cost of one missed adversarial fraud, and tune epsilon and the adversarial-clean mix to minimize expected loss.

The fifth pitfall is ignoring LLM-specific attack vectors. Token-level attacks like GCG (Greedy Coordinate Gradient) and prompt injection are the 2024-2026 frontier. At Anthropic, OpenAI, or Mistral, expect the interviewer to pivot from image FGSM to "how would you defend a chat model against jailbreaks". The same threat-model framework applies, but the perturbation space is discrete tokens, not pixels.

If you want to drill questions exactly like this one — threat models, FGSM derivations, defense trade-offs — NAILDD has 1,500+ ML interview problems organized by topic, with model answers from FAANG hiring managers.

FAQ

Is this an officially-endorsed technique?

No. The methods covered here come from peer-reviewed papers: Szegedy et al. 2013 for the original observation, Goodfellow et al. 2014 for FGSM, Madry et al. 2017 for PGD and adversarial training, Cohen et al. 2019 for randomized smoothing, and Croce & Hein 2020 for AutoAttack. Cite the papers, not blog posts, when an interviewer asks.

How important is adversarial robustness for non-security ML roles?

Less important for offline analytics and recommendation systems, very important for fraud, content moderation, biometrics, autonomous systems, and anything in the LLM safety space. If the role description mentions trust and safety, security ML, or model evaluation, expect at least one adversarial question. For a vanilla recommender role, a one-paragraph high-level answer is usually enough.

What is the difference between adversarial examples and out-of-distribution inputs?

Both fool the model, but adversarial examples are crafted with knowledge of the model's gradients to maximize loss inside a small budget, while OOD inputs are natural samples from a different distribution. Adversarial defenses do not transfer well to OOD detection and vice versa. Mentioning this distinction unprompted scores points.

How do you measure adversarial robustness reliably?

Use a strong, standardized attack suite like AutoAttack at a fixed epsilon, report robust accuracy and clean accuracy together, and include at least one transfer attack from an independently-trained model to catch gradient masking. RobustBench publishes a leaderboard with this protocol — pointing to it is a clean interview answer.

Can you defend an LLM the same way you defend an image classifier?

Partially. The threat-model framework transfers — goal, knowledge, budget, access — but the optimization space is discrete tokens, which breaks PGD. Current LLM defenses lean on RLHF refusal training, system-prompt hardening, output filtering, and constitutional AI-style critique loops. None of them are airtight, which is why red-teaming is a full-time job at every frontier lab.

Should I bring up adversarial robustness in a system design round?

Yes, briefly, if the system is user-facing and decisions matter. A one-sentence mention — "we would adversarially train against PGD at epsilon matched to our threat model and rate-limit the API" — signals maturity without derailing the round.