ML data versioning on the data science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why data versioning shows up in DS loops

Picture the prompt verbatim from a senior DS loop at Stripe last quarter: "Model v3 was trained on data snapshot from March 14. We retrained on April data and accuracy dropped 5 points. Walk me through how you'd debug — and how you'd prevent this from being a mystery in the first place." The interviewer is not asking about hyperparameters. They are asking whether you have ever owned a model whose training data was a moving target, and whether you reach for content-addressed storage before you reach for model.fit().

Data versioning lives at the intersection of MLOps and reproducibility, and that is exactly why it is a favorite signal at Meta, Anthropic, Databricks, and any shop where models ship to production. Code versioning is a solved problem — git did it 20 years ago. Data versioning is not solved, because a 200 GB parquet dump does not fit in .git/objects and a feature table mutates while you sleep. Candidates who can articulate the gap between git commit and dvc commit get pulled to senior bands; candidates who say "I just dump CSVs to S3 with a date suffix" get a polite no.

The frame to internalize: a model artifact is a function of (code, data, environment, seed, hyperparameters), and any one of those five drifting silently breaks reproducibility. Most teams version code and hyperparameters via git + MLflow. Data and environment are where the bodies are buried. This post is the cheat sheet for the data half.

Tool landscape: DVC, LakeFS, Pachyderm, Delta Lake, Git-LFS

Every interviewer has a favorite tool, and you do not need to be an expert in all five — you need to know which one fits which scenario. The table below is what I would draw if asked to compare them on a virtual whiteboard.

Tool Storage model Best for Branch / diff Watch out for
DVC Pointer files in git, blobs in S3/GCS/Azure Small-to-mid ML teams; tight git workflow Git branches via .dvc pointers Slow on millions of tiny files
LakeFS Object store overlay (S3-compatible) Lakehouse stacks; branch-per-experiment Native git-like branches over the lake Extra service to operate
Pachyderm Content-addressed object store + pipelines Reactive pipelines, data lineage as a first-class citizen Commits per dataset, automatic re-runs Heavier infra; Kubernetes-native
Delta Lake Parquet + JSON transaction log Spark / Databricks shops; ACID on the lake Time travel by version or timestamp Versioning is a side effect of ACID, not the point
Git-LFS Pointers in git, blobs on LFS server Single large files (model weights, fixtures) Git branches Not designed for ML datasets; expensive at TB scale

Load-bearing trick: match the tool to the failure mode you want to prevent. DVC prevents "what data did I train on?" Delta Lake prevents "did the upstream table mutate mid-job?" LakeFS prevents "can two teams experiment on the lake without stepping on each other?"

If the interviewer asks for a one-liner ranking, I usually say this: DVC for the model artifact, Delta Lake or Iceberg for the source-of-truth table, LakeFS when the lake itself needs branches. Pachyderm is the strongest answer if they explicitly ask about lineage and automatic re-runs. Git-LFS is the wrong answer for anything bigger than a handful of GB.

That ranking is mine, not a vendor recommendation — your shop may have different gravity wells.

A worked DVC example you can mimic on the whiteboard

Interviewers love when you write five lines of shell instead of waving your hands. Here is the minimum loop that demonstrates you have actually used DVC:

# initialize once per repo
dvc init
dvc remote add -d storage s3://acme-ml-data/dvcstore

# version a training file
dvc add data/train.parquet
git add data/train.parquet.dvc data/.gitignore
git commit -m "data: train.parquet v1 (2026-05-18 snapshot)"
dvc push

# someone else, three weeks later
git checkout 7f3a2c1
dvc pull   # pulls the exact blob this commit pointed to

The interview move is to verbalize what each step buys you. dvc add computes a content hash of the file and writes a tiny .dvc pointer to git. dvc push uploads the blob to the configured remote. git checkout + dvc pull is the magic — you get a bit-exact reconstruction of the dataset that produced a given commit, with no risk of someone having overwritten an S3 key.

For a pipeline with multiple stages, you escalate to dvc.yaml:

stages:
  features:
    cmd: python src/build_features.py
    deps:
      - data/train.parquet
      - src/build_features.py
    outs:
      - data/features.parquet
  train:
    cmd: python src/train.py
    deps:
      - data/features.parquet
      - src/train.py
    params:
      - learning_rate
      - max_depth
    outs:
      - models/model.pkl
    metrics:
      - metrics.json
dvc repro      # runs only the stages whose deps actually changed
dvc metrics diff HEAD~1

The second command is what wins the round: dvc metrics diff lets you compare model metrics across git commits without re-running anything. That is the answer to "how do you know v4 is better than v3" — you do not eyeball notebooks.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Reproducibility checklist

Data versioning is necessary but not sufficient. The full checklist a senior DS should rattle off:

  • Code: pinned commit hash, ideally tagged.
  • Data: content-hashed pointer (DVC, LakeFS commit, Delta version, Iceberg snapshot ID).
  • Environment: requirements.txt with hashes, or a Dockerfile with pinned base image digest, or a uv.lock / poetry.lock.
  • Seeds: random, numpy, torch, tf — all four if you touch deep learning. Set PYTHONHASHSEED for hash-based ops.
  • Hardware: GPU model and CUDA version matter; non-deterministic kernels exist.
  • Hyperparameters: tracked via MLflow / Weights & Biases / DVC params.

Sanity check: if you cannot reproduce yesterday's metrics within 0.1% absolute on the same hardware, something in the list above is uncontrolled. Hunt it down before you ship.

For research publication, NeurIPS and ICML reproducibility checklists now require most of the above. For production ML, your auditor or compliance team will eventually require it too. Either way, the cheapest moment to build the muscle is before the model is in front of users.

Common pitfalls

The first trap is treating S3 versioning as a substitute for a versioning tool. S3 bucket versioning keeps old object versions, but it does not give you a commit graph, does not associate a dataset state with a code commit, and does not let your colleague reproduce your run with a single command. It is a safety net against accidental deletion, not a versioning system. If your team's answer to "what data did this model train on?" is "we have S3 versioning enabled", you do not have data versioning — you have a backup.

A second trap is versioning the wrong layer of the stack. People version the raw landing zone where the schema is unstable, then build features off a downstream table that nobody versions. The right cut is to version the artifact at the boundary where the model contract lives — usually the feature table or the train/eval split. Versioning everything is expensive and noisy; versioning nothing is reckless; versioning the contract layer is the cheap middle. Pair this with a feature store and you get genuine reuse across models.

A third pitfall, especially with Git-LFS, is storage cost surprise. LFS bandwidth and storage are billed by the host (GitHub, GitLab), and a team that pushes a 5 GB dataset on every iteration can rack up a four-figure monthly bill in a quarter. DVC with an S3 remote costs roughly the same as raw S3 — pennies per GB-month — because DVC is just storing content-addressed blobs in your own bucket. Pick the tool that scales with the storage backend you already pay for.

The fourth pitfall is non-determinism creeping in through data. You version train.parquet, you set the seed, you re-run the training script, and you still get a different model. Nine times out of ten the culprit is row order: parquet readers do not guarantee deterministic row order across versions, and downstream stochastic operations (mini-batch sampling, k-fold splits) depend on row order. The fix is to sort by a stable key before the split, or to checkpoint the post-split arrays themselves as artifacts.

The fifth and most career-relevant pitfall is conflating data versioning with model versioning. They are not the same axis. A model version is "the binary I would deploy"; a data version is "what I trained on". Tools like MLflow track the former, DVC tracks the latter, and MLflow can log the DVC hash as a tag — that is the integration pattern interviewers want to hear. Keep the two layers separated in your mental model and you will not get tripped up by "okay but how does that interact with MLflow?"

If you want to drill MLOps and data-science questions like this every day, NAILDD is launching with 1,500+ interview problems across the full DS loop — SQL, ML, system design, behavioral.

FAQ

Should I learn DVC or LakeFS first?

DVC has a softer learning curve and you can use it on a laptop with a free S3 bucket — start there if you are interviewing for IC roles and need to demonstrate hands-on muscle. LakeFS is more relevant if you are interviewing at a shop that runs a lakehouse (Databricks, Snowflake + Iceberg, or a self-managed Spark stack), because it solves the multi-team branching problem at the lake layer. For most candidates the right answer is DVC for projects, LakeFS as a concept you can explain.

What's the difference between Delta Lake time travel and DVC?

Delta Lake time travel is a property of the storage format — every write to a Delta table appends a transaction log entry, and you can query the table as of version N or timestamp T. It is excellent for "show me yesterday's snapshot of this table" inside Spark. DVC is a workflow layer that ties a content-hashed dataset state to a git commit. The two are complementary: a typical pattern is to pin a Delta version number in your DVC pipeline so the pipeline reads the same table version every time it reruns.

Is Git-LFS ever the right answer in an interview?

Yes, for two narrow cases: fixtures under a few hundred MB that change rarely (golden test data, small reference embeddings), and single large binary artifacts like a model weight file you want versioned alongside code. Beyond that, the moment a dataset is in the GB-and-growing range, mention LFS only to dismiss it — interviewers will be relieved you noticed.

How do I version streaming data?

Streaming is the hard case and the honest answer is "you don't version every event, you version the materialized state at well-defined checkpoints." Use Kafka log compaction or a streaming-to-lakehouse pattern (Delta Live Tables, Iceberg streaming writes) and version the materialized table snapshot, not the stream. Pair this with feature store online/offline consistency checks so the model sees the same features at training and serving time.

Does data versioning matter for fine-tuning LLMs?

More than ever. Fine-tuning datasets are typically small (thousands to low millions of rows) but high-impact — a single bad batch of labels can poison alignment. Version the exact JSONL you fine-tuned on, store its content hash next to the resulting model card, and keep the cleaning script versioned alongside. When the model misbehaves three months later and someone asks "what was in the training set?", you want a one-command answer, not a Slack archaeology expedition.

How does this connect to the EU AI Act and similar regulations?

High-risk AI systems under the EU AI Act require documented training data provenance, including the ability to reproduce training runs for audit. Whether or not you are subject to the Act today, the engineering pattern that satisfies it — content-hashed datasets tied to model artifacts tied to code commits — is the same pattern that lets you debug a 5-point accuracy drop on a Tuesday. Build it for your own sanity and you get compliance as a side effect.