Data lineage on a DE interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why interviewers ask about lineage

The moment a warehouse crosses 100+ tables, lineage stops being a nice-to-have and starts being the only thing between you and a 3am Slack ping that says "the revenue dashboard is wrong, can you look?". Hiring managers at Snowflake, Stripe, Airbnb have all lived that incident where nobody could tell which upstream job feeds the broken column. So when they sit across from a DE candidate, they want to know: do you understand column-level lineage well enough to debug a production fire, and have you ever actually rolled out a lineage tool.

Expect questions framed three ways. The blunt one — "how do you track where a column comes from?" — is a vocabulary check. The design one — "design lineage for a hybrid Airflow + dbt + Spark stack" — is the real signal, because hybrid stacks are where table-level tracking falls apart. The forensic one — "a number in BI jumped 30% overnight, walk me through your debug" — checks whether you reach for lineage or start grepping SQL files. Strong candidates answer the third by describing impact analysis upstream.

What data lineage actually is

Data lineage is a dependency graph: where data came from, what transformations it passed through, and where it ended up. The simplest mental picture is a chain that mirrors the medallion layers most warehouses use:

source.orders.csv → bronze.orders → silver.orders → gold.daily_revenue → BI dashboard

At each node you want three things attached: the transformation that produced it (a dbt model, a Spark job, an Airflow task), the owner (a team or an on-call rotation), and the tests that gate it (uniqueness, freshness, accepted ranges). A graph without owners is a museum exhibit — pretty, useless when something breaks.

Load-bearing distinction: lineage is not the same as a data catalog. A catalog tells you what exists; lineage tells you how it connects. Tools like DataHub do both, which is why people conflate them — but on the interview, name them separately.

Levels of lineage

There are three levels, and an interviewer will probe whether you can name all three and explain when each is enough.

Pipeline-level is the coarsest: a graph of which job produced which output. Airflow's DAG view is essentially free pipeline-level lineage — it tells you process_orders_dag ran before build_daily_revenue_dag, but says nothing about which tables flowed through. Useless for "which column got corrupted".

Table-level is the working minimum for production warehouses. Nodes are tables; edges say "silver.orders depends on bronze.orders and bronze.products". This is what dbt gives you from {{ ref() }} calls, and what DataHub builds when pointed at Snowflake's query log.

Column-level is the gold standard and the answer interviewers want for "how do you debug a wrong number in a BI dashboard". It traces individual columns through SQL expressions:

gold.daily_revenue.country = silver.orders.country (via join on dim_country.country_code)
gold.daily_revenue.revenue_usd = SUM(silver.orders.amount * fx_rate.rate)

Column-level is harder to extract because you need to parse SQL ASTs, not just track reads. DataHub, OpenLineage's Spark agent, and dbt's --column-level-lineage flag do this — each with gaps on dynamic SQL and stored procedures.

Level What it tracks Where you get it free When it's enough
Pipeline Job → job dependencies Airflow, Dagster, Prefect UI Debugging orchestration only
Table Table → table dependencies dbt docs, DataHub on query log Most teams under ~200 tables
Column Column → column expressions OpenLineage Spark, dbt CLL, DataHub Compliance, debugging gold metrics

What lineage is good for

The interview-friendly framing is six concrete use cases. Hit at least three when you answer.

Impact analysis. "If I change silver.orders.amount to be in cents instead of dollars, what breaks?" Lineage gives you the downstream list — every table, dashboard, and ML feature that touches that column. Without lineage, you ship the change, then find out in production when a payments dashboard goes negative.

Root cause analysis. A BI dashboard shows weird revenue. You walk upstream node by node, checking row counts and test results, until you find the first broken layer. A 4-hour incident drops to 15 minutes when lineage replaces grep.

Compliance. GDPR, CCPA, HIPAA all require you to know where PII lives. "Show me every table that touches user.email" should be a 2-second query, not a half-day audit. Column-level lineage answers this directly.

Debugging. "This column is null" — first node where it becomes null is where the bug lives. Lineage shortens the search from O(tables) to O(depth).

Documentation. Auto-generated, always current. A wiki page documenting your warehouse is a lie within a week. A lineage graph regenerated nightly is the only honest documentation.

Cost optimization. Who actually reads this table? If the answer is "nothing downstream", you can archive it. At cloud-warehouse pricing, dropping unused tables saves $5-20k/month for mid-size data orgs.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Tools: DataHub, Amundsen, OpenLineage

The interviewer wants to hear three to four names with crisp positioning.

DataHub (open-sourced by LinkedIn) is the most common pick in 2026. It does both catalog and lineage, supports column-level via SQL parsers, integrates with Snowflake, BigQuery, dbt, Airflow, and Spark, and ships a usable UI. Self-host on Kubernetes or pay for DataHub Cloud. This is the answer that gets nods at most companies.

Amundsen (Lyft) is older, focused on discovery — search-first UI, less aggressive on column-level lineage. Still alive but losing mindshare to DataHub and OpenMetadata.

OpenMetadata is the third horse in the race. Active community, opinionated schema, good dbt and Airflow integrations. Pick it if you want metadata-as-code with strong type definitions.

OpenLineage is not a tool, it's a standard — and that distinction matters on the interview. Pipeline tools (Airflow, dbt, Spark, Flink) emit events in OpenLineage's JSON format; collectors like DataHub, Marquez, or Astronomer Cosmos ingest them. Saying "we use OpenLineage" without naming a collector is incomplete.

dbt docs is the freebie. If your stack is 100% dbt, the built-in dbt docs generate gives you a queryable lineage graph with no extra infrastructure. Once you have non-dbt jobs in the picture, you need a real catalog.

Marquez (WeWork-originated) is the reference implementation of OpenLineage. Good for prototypes, less polished as an end-user UI than DataHub.

Tool Catalog Table lineage Column lineage Best for
DataHub Yes Yes Yes (via SQL parser) Hybrid stacks, large orgs
OpenMetadata Yes Yes Partial Teams that want strong typing
Amundsen Yes Yes Limited Search-heavy discovery
dbt docs Models only Yes Optional flag dbt-only stacks
Marquez Minimal Yes Via OpenLineage Reference / prototyping
OpenLineage n/a (standard) n/a n/a Wire protocol, paired with collector

Wiring lineage into the pipeline

This is where interviewers separate readers from operators. Name the integration points explicitly.

dbt is the easiest case. Lineage between models comes free from {{ ref('upstream_model') }} and {{ source('schema', 'table') }} references — dbt builds the DAG at parse time, no runtime needed. Column-level lineage requires the --column-level-lineage flag on dbt docs generate (parses SQL with sqlglot under the hood). This is also why dbt-heavy stacks adopt lineage faster than pure-Spark ones.

Airflow needs the openlineage-airflow provider. Once installed and configured with a transport (usually pointing at DataHub or Marquez), it emits a START/COMPLETE event per task with inputs and outputs. The catch: extractors only exist for common operators (BigQueryOperator, SnowflakeOperator, PythonOperator with hints). Custom operators need a custom extractor or you get empty inputs.

Spark uses the OpenLineage Spark agent — a Java agent attached at submit time. It introspects Spark's logical query plan, which means it captures column-level lineage natively, even for complex joins. The downside is plan parsing for non-trivial queries can miss columns; you'll want to validate against a known SQL.

Custom Python jobs — when you have a requests.get()pandasdf.to_sql() script that nobody can wrap in a framework, emit OpenLineage events manually:

from openlineage.client import OpenLineageClient
from openlineage.client.event_v2 import RunEvent, RunState, Job, Run, Dataset
from uuid import uuid4
from datetime import datetime, timezone

client = OpenLineageClient(url="http://datahub:8080")
run_id = str(uuid4())

client.emit(RunEvent(
    eventType=RunState.START,
    eventTime=datetime.now(timezone.utc).isoformat(),
    run=Run(runId=run_id),
    job=Job(namespace="etl", name="process_orders"),
    inputs=[Dataset(namespace="source", name="orders")],
    outputs=[Dataset(namespace="warehouse", name="silver_orders")],
    producer="https://example.com/etl"
))

DataHub (or Marquez) is deployed separately — Kubernetes Helm chart for self-host, or DataHub Cloud if you don't want to babysit Elasticsearch. It then ingests from your OpenLineage stream, your dbt artifacts (manifest.json and catalog.json), and your warehouse's query history.

Sanity check: if your lineage graph shows zero edges from your largest table, you've either misconfigured the ingestion or your analysts are bypassing your tracked layer with ad-hoc SQL. Both are real problems.

If you want to drill data engineering interview patterns like this every day — column-level lineage debugging, OpenLineage extractor design, dbt + Spark integration tradeoffs — NAILDD is launching with 500+ DE problems across exactly these scenarios.

Common pitfalls

The most common failure mode is stopping at pipeline-level lineage and thinking you're done. Airflow's DAG view gives you task-to-task arrows for free, which feels like lineage but answers none of the questions that matter — which column, which transformation, which downstream report. The fix is to commit to at least table-level from day one, treat pipeline-level as orchestration metadata, not lineage, and budget for column-level once you hit the 200-table mark.

A close cousin is maintaining lineage manually in a wiki. A Confluence page titled "Data Warehouse Diagram" is out of date within two sprints. The fix is automated extraction — dbt parse, OpenLineage events, query-log scanning — even if the coverage is incomplete at first. An 80%-accurate auto-generated graph beats a 100%-accurate diagram that's six months stale.

Ignoring ad-hoc SQL is the third trap. Analysts and PMs write queries outside dbt, hit gold tables directly, and your lineage tool sees none of it. The fix is SQL parsing on the warehouse's query log — Snowflake's QUERY_HISTORY, BigQuery's INFORMATION_SCHEMA.JOBS, Redshift's STL_QUERY. DataHub and OpenMetadata both have ingestors for this. Without it, your "downstream consumers" list is missing the loudest stakeholder in the company.

Lineage without owners is just a pretty graph. When a table breaks, the question isn't "what depends on this?" — it's "who do I page?" If your catalog doesn't enforce an owner on every node (team, individual, or rotation), the graph stops being actionable. The fix is making owner a required field at table creation and failing CI if it's missing.

Not wiring lineage into alerting is the final mistake. Upstream breaks, downstream consumers find out from a slack thread two hours later. The fix is automatic notification: when a freshness or schema test fails on silver.orders, ping every owner of every downstream node within one hop. Modern catalogs (DataHub, OpenMetadata) ship this as a built-in feature — turn it on.

FAQ

Should I roll out lineage from day one of a new warehouse?

Not day one, but earlier than most teams do. If your pipeline has fewer than 10 tables and one owner, lineage is overkill — you can hold the graph in your head. Past 30-50 tables or once a second team onboards, the cost of not having lineage dominates. The pragmatic path: install dbt docs the moment you adopt dbt (free), then add DataHub or OpenMetadata when you bring in a non-dbt source or job.

Is dbt's built-in lineage enough on its own?

Only if your entire transformation stack is dbt and you don't care about source-system or BI-layer visibility. The moment you have an Airflow DAG calling a Python script, a Spark job feeding dbt, or a Looker explore reading gold tables, dbt docs covers a fraction of true lineage. Hybrid stacks need a real catalog — DataHub, OpenMetadata, or a paid equivalent. Those catalogs ingest dbt artifacts natively, so you augment dbt's lineage rather than replacing it.

How do I get column-level lineage out of legacy stored procedures?

With difficulty. SQL parsers (sqlglot, OpenLineage's, DataHub's) handle clean SELECT statements but struggle with dynamic SQL, recursive CTEs, and procedural T-SQL. Realistic options: rewrite the worst offenders into dbt or Spark SQL, accept table-level lineage for procedures and column-level for everything else, or buy a commercial parser like Manta for big legacy migrations.

Does column-level lineage slow down the pipeline?

Generation does, ingestion doesn't. The OpenLineage Spark agent adds 5-15% overhead to job submission depending on query complexity. dbt's --column-level-lineage makes dbt docs generate 2-3x slower but doesn't touch runtime. Lineage queries hit the catalog, not your warehouse, so they're free at read time.

Who owns the lineage tool — DE, platform, or analytics?

Whoever owns the data platform usually ends up owning it, because lineage tools are infrastructure — Kubernetes, Elasticsearch, Kafka, on-call rotation. The pattern that works: platform team owns the tool, DE owners maintain their nodes, analytics consumes the graph. Pushing ownership to analytics never works long-term — they're customers of the graph, not operators.