Self-supervised CV in data science interviews
Contents:
Why SSL for vision shows up in DS interviews
An interviewer at a self-driving team or a medical imaging startup will almost always ask the same opener: you have 10 million unlabeled images and budget for only 5,000 labeled ones — what do you do? The wrong answer is "fine-tune a ResNet from ImageNet weights." The right answer leans on self-supervised pre-training, then a small supervised head. This is the load-bearing trick of modern computer vision in 2026, and interviewers test whether you understand why it works, not just the names of the papers.
The second reason SSL keeps appearing in loops at Meta, Tesla, OpenAI, and Anthropic is that DINOv2 and MAE features are now strong enough to be used frozen for downstream tasks — segmentation, depth estimation, retrieval — without any task-specific fine-tuning. A candidate who can compare contrastive vs masked vs distillation objectives, and explain which one to pick for a noisy industrial dataset, signals senior judgement. Most candidates can't, which makes it a cheap filter.
Load-bearing trick: SSL works because pre-training on huge unlabeled data forces the encoder to learn invariances (to crop, color, occlusion) that supervised ImageNet pre-training never sees. The downstream labeled set then only needs to learn the task head, not the representation.
The five methods you must know
SimCLR — the contrastive baseline
Take one image, apply two random augmentations (crop, color jitter, blur), push both views through the same encoder, then pull their embeddings together while pushing all other images in the batch apart. The loss is InfoNCE with temperature roughly 0.1–0.5, and the catch is that you need a very large batch — the original paper used 4,096, and quality drops off fast below 512 because the negatives are too easy. This is also why SimCLR fell out of fashion the moment people wanted to train on a single 8-GPU node.
# Conceptual InfoNCE for two views z_i, z_j of the same image
def info_nce(z_i, z_j, tau=0.1):
z = torch.cat([z_i, z_j], dim=0) # 2N x d
sim = (z @ z.T) / tau # cosine similarity matrix
sim.fill_diagonal_(-1e9) # ignore self
targets = torch.cat([torch.arange(len(z_i)) + len(z_i),
torch.arange(len(z_j))])
return F.cross_entropy(sim, targets)MoCo v3 — contrastive without the batch-size pain
MoCo replaces the in-batch negatives with a queue of momentum-encoded features from previous batches. The key encoder is a slow exponential moving average of the query encoder (momentum coefficient around 0.99), which keeps the queue features consistent over training. MoCo v3 drops the explicit queue, uses a ViT backbone, and adds a prediction head — it's effectively the production default whenever the team likes contrastive objectives.
MAE — the masked autoencoder
Split an image into 16×16 patches, randomly mask 75% of them, pass only the visible 25% through a heavy ViT encoder, and reconstruct the masked patches with a lightweight decoder. The asymmetry is what makes it cheap: the encoder only ever sees a quarter of each image, so you can train a ViT-Huge on ImageNet-1k with a budget that would barely cover a SimCLR ViT-Base run. Linear probe on ImageNet hits ~76% top-1, and fine-tuning a ViT-L reaches ~86%.
DINO and DINOv2 — self-distillation
A student network is trained to match the output of a teacher network whose weights are an EMA of the student. No labels, no negatives, no reconstruction — just a softmax cross-entropy between two views of the same image after a centering operation that prevents collapse. DINOv2 scaled this to 142M images and produced the backbone most people actually use in production today; the emergent self-attention segmentation maps are the trick that convinced everyone the method was special.
BYOL — the "no negatives" surprise
BYOL showed that you could drop the negatives entirely if you used a predictor head and a momentum target encoder. For a while the community was unsure why it didn't collapse to a constant; the modern consensus is that batch normalization plus the predictor implicitly inject the negatives. BYOL is rarely the production choice in 2026, but interviewers love it because the why does it not collapse? discussion separates surface knowledge from real understanding.
Method comparison
The table interviewers want you to be able to sketch on the whiteboard:
| Method | Objective | Needs negatives? | Typical batch | Linear probe (IN-1k) | Best for |
|---|---|---|---|---|---|
| SimCLR | Contrastive (InfoNCE) | Yes (in-batch) | 4,096+ | ~69% (RN50) | Teaching the idea; small experiments |
| MoCo v3 | Contrastive + momentum | Yes (queue / in-batch) | 1,024–4,096 | ~76% (ViT-B) | Contrastive on ViT with limited GPUs |
| MAE | Masked reconstruction | No | 1,024–4,096 | ~76% (ViT-L) | ViT pre-training when fine-tuning later |
| DINO / DINOv2 | Self-distillation (EMA) | No | 1,024+ | ~83% (ViT-L, DINOv2) | Frozen features, dense prediction |
| BYOL | Bootstrap (no negatives) | No | 4,096 | ~74% (RN50) | Whiteboard discussion; rarely production |
A few values to keep in your head: DINOv2 ViT-L frozen beats ImageNet supervised ViT-L fine-tuned on most dense-prediction transfer benchmarks. MAE is the cheapest large-ViT pre-training because the encoder only sees 25% of the input. SimCLR is the only method here that is genuinely batch-size hungry.
Sanity check: if your team is on a single 8×A100 node and wants a ViT backbone for downstream segmentation, the default in 2026 is MAE pre-train then DINO-style fine-tune, or just download DINOv2 weights and freeze them.
Production workflow
The standard 2026 pipeline at companies like Tesla, Snowflake's vision team, or any medical imaging startup with <50k labeled images looks like this. First, you either download DINOv2 weights or run a domain-specific SSL pre-training on 1M–10M unlabeled in-domain images — satellite tiles, chest X-rays, retail shelf photos. The compute for the second option is roughly 2,000–8,000 GPU-hours for a ViT-L on MAE, which translates to $4k–$16k on spot pricing. Second, you fine-tune a small task head on the labeled set, often with the backbone frozen for the first few epochs to avoid catastrophic forgetting. Third, you evaluate against the supervised-only baseline; if SSL pre-training doesn't beat it by at least 2–3 absolute points on your metric, your unlabeled data is probably not representative.
The shift since 2023 is that frozen DINOv2 features are now strong enough that many teams skip step two entirely and use the backbone as a feature extractor with a linear or MLP head. This is the single biggest practical change in computer vision: the encoder is no longer the thing you train. That alone has changed how interviewers grade the answer to "how would you build an image classifier."
Common pitfalls
The first pitfall is treating SSL pre-training as a free lunch. If your unlabeled images are from a different distribution than your labeled task — say, generic web photos pre-training a model for industrial defect detection — the representations may be worse than ImageNet supervised weights. The fix is to either curate the unlabeled set to match the downstream domain or use continued pre-training: start from DINOv2 weights and run a short SSL phase on your domain images. Most candidates skip this step and pay for it at evaluation time.
A second trap is overweighting linear-probe numbers from papers. Linear probe measures how linearly separable the features are, but in production you almost always fine-tune the head, sometimes the last few transformer blocks, and occasionally the whole backbone. Fine-tuned accuracy is what matters, and the ranking of methods can flip — MAE looks weak on linear probe relative to DINO, but fine-tunes to roughly the same place on ImageNet and often higher on dense prediction. Quoting linear-probe numbers in an interview without this caveat signals that the candidate has read the abstracts but not run the code.
A third pitfall, common at startups, is picking SimCLR because it's the most-cited paper. SimCLR needs batch sizes in the thousands to work well, which means either large GPU memory or careful gradient accumulation that breaks BatchNorm statistics. Teams without 8+ A100s should default to MoCo v3 or MAE; SimCLR's quality at batch 256 is poor enough that it's actively misleading as a baseline. The interviewer is checking whether you know this constraint, not whether you can name the loss.
A fourth, subtler trap is mishandling augmentations. SSL methods are extremely sensitive to the augmentation pipeline — strong color jitter and aggressive cropping are essential for SimCLR and MoCo because they define what the encoder learns to be invariant to. For domain data (medical, satellite, document) the ImageNet-style augmentations are often harmful: flipping a histopathology slide can change the diagnosis. The fix is domain-aware augmentations, and the failure mode is silent — your loss goes down fine, but downstream accuracy is flat versus ImageNet pre-training.
Related reading
- BERT vs GPT — data science interview
- Deep learning — data science interview
- Feature engineering — data science interview
- Cross-validation strategies — data science interview
- ML latency optimization — data science interview
If you want to drill DS interview questions like these every day across CV, NLP, and ML systems, NAILDD is launching with 1,500+ problems mapped to the exact loops at Meta, OpenAI, and Tesla.
FAQ
Should I learn all five methods or just one?
For a senior DS or ML engineering loop targeting a vision-heavy team, you should be able to discuss DINO/DINOv2 and MAE in depth and recognize SimCLR, MoCo v3, and BYOL by their key trick. The depth matters for DINOv2 because it's the production default in 2026; MAE matters because it's the cheapest large-ViT pre-training. The others are mostly historical context and useful for the "why doesn't BYOL collapse?" discussion.
Is contrastive SSL dead?
No, but its center of gravity has moved. MoCo v3 is alive and well for ViT pre-training, and CLIP-style contrastive (image-text pairs) is the dominant multimodal recipe. What's dead is the assumption that contrastive is automatically better than reconstructive — MAE and the distillation methods caught up by 2022, and DINOv2 surpassed contrastive on most downstream benchmarks by 2024.
How does CLIP fit into this taxonomy?
CLIP is contrastive SSL where the two views are an image and its caption rather than two augmentations of the same image. The same InfoNCE loss applies, but the negatives come from other image-caption pairs in the batch. Interviewers sometimes ask whether CLIP is "really" self-supervised — the practical answer is yes, because the captions are scraped at web scale with no manual labeling, even though they're technically a second modality.
What's a reasonable SSL budget for a startup?
If you have 1–10 million in-domain unlabeled images and an 8-GPU node, you can MAE-pre-train a ViT-L in roughly 3–5 days at a spot cost of $4k–$8k. DINOv2-style training is more expensive because of the multi-crop augmentation and EMA teacher, closer to $10k–$20k. For most teams the better answer is to start from public DINOv2 weights and do a short continued pre-training on domain data for $500–$1,500.
When should I just use ImageNet supervised weights?
When you have a very small unlabeled set (under ~100k images) that's not meaningfully different from web photos, ImageNet supervised pre-training is still competitive and saves you the SSL engineering. Once you cross ~500k in-domain unlabeled images, SSL pre-training (or continued pre-training from DINOv2) almost always wins. The crossover point is fuzzy, so the practical answer in an interview is "I'd run both and compare on a held-out validation set" — interviewers like candidates who refuse to commit to a method before seeing the data.
Does SSL work for video?
Yes, with extra machinery. VideoMAE and DINOv2-Video extend the patch-masking and self-distillation ideas to space-time tubes. The trick is that temporal redundancy lets you mask even more aggressively — 90%+ masking ratios are common for video — but the compute scales with frame count, and most teams downsample to 16 frames per clip. This is its own interview topic and rarely comes up in a generalist DS loop.