Data contracts in the data engineering interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why this question keeps showing up

Picture the Monday morning where the orders.events Kafka topic silently changes amount from decimal(10,2) to a string because some backend engineer wanted to attach a currency symbol. Downstream, the finance dashboard reads zeros for six hours, the ML model that triggers fraud holds approves $40k of obvious abuse, and the on-call data engineer wakes up to a Slack channel full of red. Data contracts exist because this exact failure mode has happened at every company that scaled past a handful of producers.

When a Stripe, Airbnb, or Snowflake interviewer asks "how do you coordinate schema between teams?", they are not asking for a buzzword. They want to hear that you have lived through the Monday morning above and have an opinion on schema, SLA, ownership, and breaking-change policy as one coupled problem. The senior signal is treating the contract as the API of your data, with the same versioning discipline you would apply to a public REST endpoint — not a wiki page nobody reads.

The shape of the answer matters more than the specific tool. If you say "we use Soda" and stop, that is a junior answer. If you say "we treat each domain table as a versioned product with an owning team, a CI gate on the producer side, runtime checks on the consumer side, and a 90-day deprecation window," that is the senior answer regardless of which tool you name.

What a data contract actually contains

A contract is the union of six things, and an interviewer will gently probe to see if you can list them without prompting.

Schema is the obvious one: column names, types, nullability, uniqueness, primary keys, and any structured constraints like enums or value ranges. Quality SLA is freshness, completeness, and accuracy thresholds — for example freshness 1 hour, completeness 99.9%, uniqueness 100% on order_id. Ownership names the producing team and an accountable engineer; without a name, no one fixes anything. Versioning records the current version and a history so consumers can pin against a known shape. Breaking-change policy spells out notification period (usually 30 to 90 days), the deprecation path, and what counts as breaking versus additive. Compliance flags mark PII, financial data, and any regulated fields so downstream consumers can apply the right masking, retention, and access controls.

Component Purpose Example value
Schema Shape of the data order_id bigint not null unique
Quality SLA What "good" means freshness 1h, completeness 99.9%
Ownership Who fixes it at 3am orders-team, oncall: @alice
Versioning Pinnable history 1.2, with 1.1 deprecated
Breaking-change policy How consumers learn 90-day deprecation, semver
Compliance Regulatory metadata contains_pii: true, retention 7y

Load-bearing trick: If you only remember one thing — ownership without an on-call rotation is decorative. The contract is real when there is a human whose pager rings on SLA breach.

The contract format interviewers want to see

Most production setups store contracts as YAML in git, reviewed through pull requests by both the producer team and at least one consumer team. The YAML lives next to the producer's code or in a central data-contracts repository depending on org size. A clean version looks like this:

name: orders
domain: commerce
owner: orders-team
oncall: "@orders-oncall"
version: 1.2

schema:
  - name: order_id
    type: bigint
    nullable: false
    unique: true
    description: "Primary key, monotonically increasing."
  - name: customer_id
    type: bigint
    nullable: false
    references: customers.customer_id
  - name: amount_usd
    type: decimal(10,2)
    nullable: false
    range: [0, 1000000]
  - name: created_at
    type: timestamp
    nullable: false
    timezone: UTC

sla:
  freshness: 1 hour
  completeness: 99.9%
  uniqueness: 100%
  accuracy:
    - amount_usd > 0

compliance:
  contains_pii: false
  retention_days: 2555  # 7 years for financial records

deprecation:
  notice_period_days: 90
  breaking_changes:
    - "drop column"
    - "type narrowing"
    - "nullability tighten"

The reason interviewers like seeing YAML in code rather than a Confluence page is reviewability. When a producer opens a PR that changes amount_usd from decimal(10,2) to decimal(8,2), the diff is in front of every consumer in the CODEOWNERS file. The conversation happens before deploy, not during the incident.

Enforcement: where contracts stop being theatre

A contract that nobody enforces is a wiki page. The story you want to tell on the interview has three enforcement layers that work together.

The first layer is CI checks at producer time. When a producer opens a PR that touches the schema, a CI job parses the diff, classifies it as additive or breaking, and either approves or requires explicit consumer sign-off. The diff classifier is usually 50 lines of Python: a new nullable column is fine, a dropped column or a type narrowing requires a separate PR with the deprecation window applied first. Tools like Buf do this for Protobuf, Confluent Schema Registry does it for Avro, and many teams roll a small linter for YAML or JSON Schema.

The second layer is runtime data quality checks on the consumer side or in the lake. Frameworks like Great Expectations, Soda Core, and dbt tests read the contract and translate it into runnable assertions: row counts, null rates, value ranges, freshness from the latest created_at. These run on each batch or each micro-batch and either fail the pipeline or page someone, depending on severity.

The third layer is schema registry at the streaming edge. For Kafka, an Avro or Protobuf registry rejects producer writes that violate the registered schema, which is the closest thing to a hard runtime guarantee. Confluent Schema Registry, AWS Glue Schema Registry, and Apicurio are the common names. The registry is what stops a producer from quietly serialising garbage at 3am.

Most mature stacks layer all three: CI catches the obvious, the registry catches the sneaky, and runtime checks catch the semantic drift (like amount_usd staying decimal(10,2) but the upstream service starting to write zeros for European orders because of a localisation bug).

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Tooling landscape

You do not need to memorise every vendor, but you should be able to name a representative from each category and have an opinion on when each fits.

Layer Tool When it fits
Contract spec Soda Contracts (YAML) dbt-centric stacks, batch ELT
Contract spec Great Expectations suites Python-heavy teams, ML pipelines
Contract spec dbt tests + dbt-expectations Already on dbt, low overhead
Streaming registry Confluent Schema Registry Kafka with Avro/Protobuf
Streaming registry AWS Glue Schema Registry AWS-native streaming
Lake-level Apache Iceberg schema evolution Open lakehouse, large tables
Lineage + contracts OpenLineage + Marquez Cross-tool lineage tracking
Observability Monte Carlo, Bigeye Anomaly detection on top of contracts

The honest interview answer is: most teams stitch two or three of these together rather than buy a single platform. A common stack at a Series B startup might be dbt tests for warehouse-side checks, Confluent Schema Registry for Kafka, and a thin Python CI script that classifies breaking changes in the contract YAML. A FAANG-scale stack adds a lineage layer and a custom approval workflow on top.

Sanity check: If the interviewer asks "what would you build first?" at a small startup, the right answer is dbt tests plus a CODEOWNERS rule on the schema files — not a vendor procurement. Ship the discipline before the tooling.

Common pitfalls

The first trap is treating the contract as documentation rather than executable code. A markdown table in Confluence that nobody runs against the data is a hope, not a contract. The fix is to write the spec in a format that a CI job and a runtime check both consume — YAML, JSON Schema, Avro, Protobuf — and to keep the source of truth in git with PR review, not in a wiki where edits go un-noticed.

A second pitfall is breaking changes without a deprecation window. Teams under pressure ship a "small" type change on Friday afternoon and break three downstream pipelines by Monday. The cultural fix is a policy that says additive changes ship immediately, breaking changes require a 30-to-90-day window with the old and new columns coexisting. The technical fix is a CI check that refuses to merge a breaking change unless a deprecation_started_at field is at least N days in the past.

A third pitfall is owner-less data. The contract names a team but no on-call rotation, so when the freshness alarm fires at 2am, the page bounces between three Slack channels and nothing happens until business hours. The fix is to require a real on-call handle in the contract and to wire the alert to PagerDuty or OpsGenie, not to a channel.

A fourth pitfall is over-spec'ing the contract to the point that producers stop updating it. If your contract YAML is 600 lines per table with twelve quality checks, producers will route around it. Keep the spec to what is genuinely load-bearing — typically schema, three to five SLA checks, ownership, and compliance flags — and let the data observability layer catch the long tail of anomalies.

A fifth pitfall is forgetting consumers exist. A producer ships a "non-breaking" rename from amount to amount_usd because the old name was ambiguous, but the consumer's dbt model has select amount from orders hardcoded. A rename is a breaking change even though semantically nothing changed. The fix is to ship renames as add-the-new-column, deprecate-the-old-column, two-PR sequences with the contract version bumped each time.

If you want to drill questions like this every day with feedback on your answer, NAILDD is launching with hundreds of data engineering interview problems across exactly this pattern.

FAQ

Is a data contract the same as a schema?

No. A schema is one part of a data contract — the shape of the columns and their types. The contract wraps the schema with SLA, ownership, versioning, breaking-change policy, and compliance metadata. A schema tells you what the data looks like; a contract tells you what promises the producer is making about it and what happens when those promises break.

Do I need a vendor product to do data contracts?

Not at small scale. A YAML file per table in git, a CODEOWNERS rule that forces consumer review on schema changes, and a thin Python script that classifies breaking changes in CI will get a Series A or Series B company most of the way there. Vendors like Soda, Monte Carlo, and Bigeye add value when you have hundreds of tables and dozens of producer teams; below that scale, the discipline matters more than the platform.

How is a data contract different from data observability?

Data contracts are declarative — they say what the data should look like. Data observability is detective — it watches the data and flags when something looks off. The two complement each other: the contract is the spec, the observability layer is the alarm system. A mature stack has both, with the contract feeding the observability tool the expected SLAs so it knows what to alert on.

What is a "breaking change" exactly?

The conservative definition is anything a downstream consumer's query could break on: dropping a column, renaming a column, narrowing a type, tightening nullability from nullable to not-null on an existing column, removing an enum value, or changing semantic meaning (like the amount example earlier). Additive changes — new nullable columns, new enum values added at the end, new tables in the same domain — are generally safe and ship immediately.

How do contracts interact with CDC and streaming?

For CDC pipelines built on Debezium plus Kafka, the contract usually lives in two places: a schema-registry-backed Avro or Protobuf contract for the wire format, and a warehouse-side YAML or dbt test contract for the downstream materialised tables. The two are kept in sync by either generating one from the other or by a CI job that compares them on every PR. The single source of truth question — registry first or warehouse first — depends on whether your producers or your consumers move faster.

Who owns the contract — the producer or the consumer?

The producer owns the spec and the SLA promises. Consumers own their read-side expectations (their dbt tests, their downstream alerting). The contract sits between the two and changes through a joint review process where both producer and at least one consumer sign off. When ownership is unclear, the failure mode is mutual finger-pointing during incidents, which is why explicit team-level ownership in the contract YAML is non-negotiable.