Anomaly detection on the DS interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why anomaly detection shows up on every loop

If you interview for a Data Scientist role at Stripe, DoorDash, Meta, or any company touching payments, fraud, security, or infrastructure telemetry, expect at least one anomaly detection question. The topic hits applied ML, statistics, and product judgement in one prompt — interviewers love it because weak candidates either over-engineer (a deep autoencoder for 50 CSV rows) or under-engineer (a hardcoded threshold on a multivariate problem).

The interviewer is checking three things: can you name the method family that fits the problem, can you defend the trade-off (false positives cost analyst time, false negatives cost chargebacks), and can you handle the unsupervised case with no labels. Treat anomaly detection as an applied estimation problem, not a library call.

This guide covers the four method families to discuss fluently — statistical baselines, Isolation Forest, LOF, autoencoders — plus the evaluation traps and the five mistakes that tank candidates every week.

Types of anomalies and detection modes

Before any model, name the anomaly type. Interviewers will throw a vague scenario at you ("we see weird payment patterns") and the first move is classification:

  • Point anomaly. A single observation deviates from the rest. A $1M transaction on an account whose median is $50 is the canonical example. Z-score and IQR handle this case well in one dimension.
  • Contextual anomaly. The value is normal on its own but abnormal in context. A 5,000-request burst is fine at noon for a B2C app but pathological at 3am on Christmas Eve. You need time-of-day and seasonality features before any model will help.
  • Collective anomaly. No single point is unusual, but a sequence is. A DDoS attack consists of millions of valid HTTP requests; the anomaly is the joint pattern. This is where sequential autoencoders and HMMs earn their keep.

Then declare the detection mode:

Mode Labels available Typical use Common method
Unsupervised None Most real-world problems Isolation Forest, LOF, AE
Supervised Both classes labeled Mature fraud teams with feedback loop XGBoost, LightGBM
Semi-supervised Only "normal" labeled Industrial sensors, log monitoring One-class SVM, AE

Most interview prompts are unsupervised by default — if the interviewer says "we don't have labels," that is the hint.

Statistical methods

These are your opening move. Always mention them before jumping to trees or neural nets, because they answer "what is normal?" in one line and serve as the honest baseline the interviewer expects.

Z-score. Treat each feature as approximately Gaussian, then flag anything more than three standard deviations from the mean.

import numpy as np
z = (x - x.mean()) / x.std()
anomalies = np.abs(z) > 3

IQR (interquartile range). Robust to outliers in a way Z-score is not. Flag points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1
mask = (x < q1 - 1.5 * iqr) | (x > q3 + 1.5 * iqr)

Mahalanobis distance. Extends the Z-score idea to multivariate data by accounting for the covariance structure. Useful when two features are individually fine but the combination is suspicious — high purchase amount with brand-new device fingerprint, for example.

from scipy.spatial.distance import mahalanobis
inv_cov = np.linalg.inv(np.cov(X.T))
d = [mahalanobis(row, X.mean(axis=0), inv_cov) for row in X]

Sanity check: if you can solve the prompt with Z-score or IQR and a single sensible feature, say so. Recommending Isolation Forest for a 1D problem with 200 rows is a classic over-engineering tell.

The downsides are real, though. Z-score assumes Gaussianity; IQR is univariate; Mahalanobis assumes linear feature relationships and breaks on small samples where the covariance matrix is ill-conditioned.

Isolation Forest

Introduced by Liu et al. (2008), this is the algorithm interviewers most want you to explain in your own words. The trick is elegant: anomalies are easier to isolate via random partitioning than normal points, because they live in low-density regions of the feature space.

The algorithm builds an ensemble of random trees. At each node it picks a feature at random and a split value at random within that feature's range. Points that end up in shallow leaves — short isolation path — are anomalies. The anomaly score is the average path length across the forest, normalised against the expected path length of a random binary tree.

from sklearn.ensemble import IsolationForest

clf = IsolationForest(
    n_estimators=200,
    contamination=0.05,
    max_samples="auto",
    random_state=42,
)
clf.fit(X_train)
preds = clf.predict(X_test)   # -1 = anomaly, 1 = normal
scores = clf.decision_function(X_test)

Properties to mention out loud:

  • No distributional assumption — works on non-Gaussian, mixed-type numeric features.
  • Linear time complexity in N, roughly O(N * t * log(psi)) where t is trees and psi is subsample size.
  • Handles high dimensions better than density methods.
  • Tuning is mostly two knobs: contamination (expected anomaly fraction) and n_estimators. Defaults usually get you 80% of the way.

The current industry default for tabular unsupervised anomaly detection is Isolation Forest, full stop. If you forget the name on the spot, describe the mechanism — random splits, short paths, ensemble averaging — and you will still get most of the credit.

LOF (Local Outlier Factor)

LOF is density-based: a point is anomalous if its local density is meaningfully lower than the local density of its k-nearest neighbours. The output is a ratio — values close to 1 are normal, values much greater than 1 are anomalies.

from sklearn.neighbors import LocalOutlierFactor

clf = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
preds = clf.fit_predict(X)
scores = -clf.negative_outlier_factor_

Where LOF shines is clusters of varying density. Imagine a fraud dataset where one merchant has 10,000 transactions per day and another has 200 — a global threshold will misbehave. LOF compares each point to its own neighbourhood, so it adapts.

The cost is computational. LOF is roughly O(N²) without spatial indexing, which makes it painful past a few hundred thousand rows. On the interview, if asked "why not LOF in production?" — that quadratic cost is your answer.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Autoencoders

The deep learning option. You train an autoencoder on data assumed to be mostly normal. After training, the network reconstructs normal points cheaply and fails to reconstruct anomalies — so reconstruction error becomes the anomaly score.

import torch
import torch.nn as nn

class AE(nn.Module):
    def __init__(self, d_in, d_hidden):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(d_in, d_hidden), nn.ReLU())
        self.decoder = nn.Sequential(nn.Linear(d_hidden, d_in))

    def forward(self, x):
        return self.decoder(self.encoder(x))

# After training on normal data:
recon = model(x)
error = ((x - recon) ** 2).mean(dim=1)
anomaly = error > threshold

Variants worth naming on a senior loop:

  • Denoising AE. Inject noise into inputs during training; forces the encoder to learn robust features.
  • VAE. Probabilistic — the latent representation is a distribution, and you can compute likelihood-based scores.
  • Sequential / LSTM AE. Built for time series and log sequences where collective anomalies matter more than point anomalies.

Autoencoders dominate when data is high-dimensional and structured — images, raw sensor traces, sequences. For 30-column tabular fraud data, Isolation Forest will almost always match them at a fraction of the engineering cost.

Method comparison

A clean way to answer "which method would you choose?" is to walk this table out loud:

Method Best for N scaling Assumptions Tuning effort
Z-score / IQR 1D, quick baselines O(N) Gaussian or near-symmetric Almost none
Mahalanobis Low-dim multivariate, linear O(d²) per point Multivariate normal Low
Isolation Forest General tabular O(N · t) Almost none Low–medium
LOF Variable-density clusters O(N²) Local density meaningful Medium
Autoencoder High-dim, sequences, images O(epochs · N) Enough "normal" training data High
One-class SVM Small clean datasets O(N²) – O(N³) Boundary in feature space High

Load-bearing rule: state your baseline (statistical) and your production candidate (Isolation Forest by default, AE if the data is high-dimensional or sequential). One method is a red flag.

Evaluation metrics

With labels, evaluate like any imbalanced classifier:

  • Precision / Recall / F1 — pick based on cost asymmetry. Card fraud teams usually anchor on precision at fixed recall because every false positive triggers a customer service ticket.
  • ROC-AUC vs PR-AUC. PR-AUC is the honest metric when anomalies are ~1% or rarer, because ROC-AUC stays optimistic on extreme imbalance. Always quote both.
  • Precision @ K. If the fraud analyst team can manually review K = 500 alerts per day, then precision among the top 500 scored items is what management cares about. Beats threshold-tuned F1 in any operations-heavy team.

Without labels, evaluation is genuinely hard. The practical answer is: manual review of the top-N suspected anomalies, comparison against a heuristic baseline, and stability over time. You will not get a clean PR curve, and saying so is a strength, not a weakness.

Common pitfalls

The interview is mostly won or lost here. These are the five mistakes that show up week after week.

Training on mixed normal-plus-anomaly data. Most candidates assume their training set is clean. In reality, fraud labels are noisy and a 1–5% contamination is already in the data. Isolation Forest handles it via the contamination parameter, but autoencoders learn to reconstruct anomalies as easily as normal points and become useless. The fix is to deduplicate against known labels, or use semi-supervised variants designed for contamination.

Picking a threshold arbitrarily. A scored output is not an alerting system. Candidates often say "I flag anything above 0.9" without explaining where 0.9 came from. The right answer is threshold-tuning on a validation set against an operational constraint — a target precision or a fixed daily alert budget.

Reaching for univariate methods on multivariate problems. Each individual feature passes Z-score, but the combination is suspicious. A purchase of $80 is fine. A new device fingerprint is fine. The two together at 4am from a country the account has never logged in from is fraud. Multivariate methods — Mahalanobis, Isolation Forest, AE — exist for this case. If the prompt hints at multiple features, do not stay in 1D.

Ignoring context features. Anomalies are usually contextual. A Friday-evening transaction differs from a Monday-morning one; a logged-in session differs from a guest checkout. Engineering hour-of-day, day-of-week, channel, and tenure features before the model often improves AUC more than swapping algorithms does. Highest-leverage move on most real fraud datasets.

No drift monitoring. Normal behaviour shifts over months. A February model silently degrades by June. The fix: recompute the score distribution weekly, compare against the training-time distribution with a KS test or PSI, alert on shift, retrain on a rolling window. Few candidates mention this. Doing so signals seniority.

If you want to drill DS interview questions like this every day, NAILDD ships unsupervised-learning prompts, autoencoder design questions, and fraud-style case studies from real recent loops.

FAQ

If labels exist, is supervised always better than unsupervised?

Usually yes — a tuned XGBoost or LightGBM on labeled fraud will beat any unsupervised model on the labeled subset. The catch is that labels are almost always incomplete: you have confirmed fraud for cases the team caught, but the false negatives are invisible by definition. A hybrid approach is now the norm: supervised model for the well-understood patterns, unsupervised layer to catch novelty, manual review for the disagreements.

Is Isolation Forest sensitive to feature scaling?

Less than distance-based methods like LOF or Mahalanobis, because the splits are made on individual features independently. That said, wildly different feature ranges can still skew the random split distribution in subtle ways. Standardising or robust-scaling the features is cheap insurance and you should mention it on the interview as a sensible default.

How do you set the contamination parameter?

If you have a rough business estimate — "ops thinks ~1% of transactions are fraud" — use that. Otherwise, start with 0.01 to 0.05 and validate by inspecting the top scored items. Setting contamination too high inflates false positives; too low makes the decision threshold conservative and misses real anomalies. Treat it as a calibration knob, not a true parameter.

When should you actually reach for a deep autoencoder?

Three conditions need to hold: the data is high-dimensional (hundreds of features, or images, or raw sequences), you have lots of mostly-clean training data (tens of thousands of samples minimum), and simpler methods have demonstrably underperformed. For 95% of tabular interview prompts, the right answer is Isolation Forest with engineered context features.

How do you communicate the result to a non-technical PM?

Translate anomaly scores into alert counts and precision at the team's review capacity. "At our current threshold we generate 200 alerts per day, of which roughly 60 are confirmed fraud — a precision of 30%, which is in line with industry benchmarks for unsupervised flagging." This frames the model as an operational input, not a research artefact, and it is the kind of answer that gets staff-level candidates hired.