Schema evolution on Data Engineering interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why interviewers love this topic

Schema evolution is the single biggest source of 2 AM pages on long-running data pipelines. A product engineer adds a field, a downstream Spark job throws ClassCastException, the analytics dashboard goes blank, and someone on call is reading Stack Overflow at 02:14. That story is universal — at Stripe, at Netflix, at every Series B startup — which is exactly why senior Data Engineering interviews probe it.

You will get one of three question flavors: "add a field to a production event without breaking consumers", "explain backward vs forward compatibility", or "why do we need a schema registry". All three test the same thing — whether you understand that schemas are contracts, with a lifecycle separate from the code that produces them.

Load-bearing trick: A breaking change is anything where a reader using one schema cannot make sense of bytes written by a writer using another. The compatibility mode you pick is a policy about which direction of "make sense" you guarantee.

Backward, forward, full compatibility

The three modes confuse everyone on first pass because the names are written from the reader's perspective, not the writer's. Memorize the reader-centric phrasing and you stop getting it backward in interviews.

Mode Reader uses Can read data written with Use case
Backward New schema Old schema Consumers upgrade before producers (default)
Forward Old schema New schema Producers upgrade before consumers
Full Either Either Independent upgrade order
None Anything Anything You like fire drills

What counts as a safe change depends on the mode. Adding a field with a default is backward and forward compatible — old readers ignore it, new readers see the default when reading old records. Removing a field is forward-compatible if old readers don't require it, but breaks backward unless the field had a default. Renaming a field is almost always a break in Avro and Parquet, because both treat the name as the identity (Iceberg is the exception — more on that below).

Changing a type is the spiciest case. int → long is a widening and usually safe; long → int is a narrowing that loses bits and breaks. string → int breaks unconditionally — the bytes have nothing in common. Making a field required is the silent killer: every old record without that field is suddenly invalid.

In a Schema Registry you pick a compatibility level per subject (topic). The registry then refuses to register a new schema version that violates the policy. This is the entire value proposition — a CI check that runs in production for your event contracts.

Schema evolution in Avro

Avro supports evolution through the writer schema + reader schema model. The writer schema is whatever produced the bytes; the reader schema is whatever the consumer is currently running. Avro's decoding library resolves the two — matching by field name, applying defaults for missing fields, ignoring fields the reader doesn't know about.

Adding a nullable field with a default is the canonical safe change:

// v1
{"type": "record", "fields": [{"name": "id", "type": "long"}]}

// v2 — added email with default
{"type": "record", "fields": [
  {"name": "id", "type": "long"},
  {"name": "email", "type": ["null", "string"], "default": null}
]}

V1 data is readable by v2 code (email resolves to null). V2 data is readable by v1 code (the field is silently dropped). That is full compatibility.

v1 writer → v1 reader: trivial
v1 writer → v2 reader: backward OK, email = null
v2 writer → v1 reader: forward OK, email dropped
v2 writer → v2 reader: trivial

Reordering fields is fine in Avro — resolution is by name, not position. Removing a field that had no default breaks backward; the reader literally cannot construct the record. Type widening (int → long, float → double) is allowed by spec but not every library implements it consistently — test before you ship.

Gotcha: Avro union ordering matters when defaulting. ["null", "string"] with default: null is correct; ["string", "null"] with default: null will fail validation because the default must match the first branch of the union.

Schema evolution in Parquet

Parquet is a file format, not a streaming protocol. Each file embeds its own schema in the footer. You don't evolve a schema — you evolve a table that happens to be a directory of Parquet files with potentially different schemas.

Parquet readers handle the easy cases gracefully. Reading newer files with older code — unknown columns are skipped because the reader projects only what it asked for. Reading older files with newer code — missing columns are returned as NULL if the reader's schema marks them nullable.

What plain Parquet does not support without rewriting data: rename column (most engines treat names as identity, so the old column quietly becomes NULL); non-widening type change (anything beyond int → long, int → double, or similar fails at read time); and drop-then-re-add of the same name (engine-dependent, never trust it).

This is why plain Parquet on object storage with no table format is a trap for evolving schemas. File-level metadata, no central compatibility enforcement. Modern shops solve this with Iceberg or Delta on top of Parquet.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Schema Registry (Confluent)

A Schema Registry is a central HTTP service that stores schemas, assigns each one a numeric ID, and enforces a compatibility policy when new versions are registered. Confluent's open-source registry is the de facto standard; AWS Glue Schema Registry and Apicurio are functional alternatives.

The workflow on the wire:

  1. Producer registers its schema. Registry checks compatibility against previous versions.
  2. If the check passes, registry returns a schema_id (typically a 4-byte int).
  3. Producer writes each Kafka record as [magic_byte][schema_id][payload].
  4. Consumer reads the schema_id, fetches the schema from the registry (cached after first lookup), and decodes the payload.

Compatibility levels you can set per subject:

Level Meaning
BACKWARD (default) New schema can read data written with the previous version
BACKWARD_TRANSITIVE New schema can read data written with all previous versions
FORWARD Previous schema can read data written with the new schema
FORWARD_TRANSITIVE All previous schemas can read the new schema
FULL Both backward and forward against the previous version
FULL_TRANSITIVE Both directions across all history
NONE No checks. Reserved for genuine emergencies.

The _TRANSITIVE variants matter more than people realize. Plain BACKWARD only guarantees compatibility with the immediately previous schema — so over five versions you can drift to something completely incompatible with v1 while every individual hop is "safe". Use BACKWARD_TRANSITIVE for any topic where you cannot guarantee all consumers upgrade in lockstep.

If your team runs Kafka without a registry, an interviewer reading your resume will assume you're either pre-product-market-fit or about to have an outage.

Iceberg and Delta schema evolution

Lakehouse table formats — Apache Iceberg and Delta Lake — treat schema evolution as a first-class operation, not an emergent property of files in a directory. This is the largest practical reason to adopt them over plain Parquet.

Iceberg supports the full menu in pure DDL, with no data rewrite:

ALTER TABLE events ADD COLUMN device_type STRING;
ALTER TABLE events RENAME COLUMN ts TO event_ts;
ALTER TABLE events ALTER COLUMN id TYPE BIGINT;
ALTER TABLE events DROP COLUMN deprecated_flag;
ALTER TABLE events ALTER COLUMN email DROP NOT NULL;

The trick behind rename without rewrite: Iceberg assigns every column a stable integer ID at creation and references it in every data file. The display name lives only in table metadata. Renaming is a metadata update — zero Parquet bytes touched. This alone has saved teams on petabyte tables from week-long migrations.

Iceberg supports add (with NULL default), drop (soft delete in metadata), rename, reorder, type widening (int → long, float → double, decimal precision growth), and making columns optional. Making required is generally not supported, since old records may lack the field.

Delta Lake offers a similar menu but is name-based by default. Enable column mapping ('delta.columnMapping.mode' = 'name' or 'id') to unlock Iceberg-style rename support. Without it, Delta's ALTER COLUMN RENAME is restricted.

Sanity check: If your interview answer ends with "and that's why we put Iceberg on top of Parquet", you have a defensible architecture argument. If it ends with "and we just rewrite the table", you're describing a system that doesn't scale past a few hundred GB.

Common pitfalls

The mistakes here tend to be operational rather than conceptual, which is exactly why interviewers ask scenario questions — they want to hear that you've felt the pain, not just read the docs.

Shipping a breaking change without a version bump. The producer adds a required field, the registry was set to NONE, four downstream services crash. The fix is layered: set the registry to BACKWARD_TRANSITIVE, require contract tests in CI for both ends, and review every schema change like a DB migration. Treat the schema as production infra, not as code.

Renaming a column in plain Parquet and assuming readers figure it out. Spark, Athena, Trino, and DuckDB will silently return NULL for the new name depending on file. The fix is to migrate to Iceberg first, or add the new column, dual-write for one cycle, and drop the old. Renaming in plain Parquet is never safe without a rewrite.

Changing a type without checking promotion rules. int → long looks innocuous but breaks on engines that don't auto-promote in predicate pushdown. string → int is the cinematic version — every existing record fails to decode. Always run a backfill query against a sample first.

Forgetting that defaults are evaluated by the reader, not the writer. When a v2 reader decodes a v1 record, the reader inserts the default from the v2 schema because the v1 record has no bytes for the new field. If you change the default in v3, every old record decoded by v3 differs from the same record decoded by v2. Defaults are part of the contract; bumping them is itself a schema change.

Treating BACKWARD like BACKWARD_TRANSITIVE. The registry will let you drift across five versions until v6 is incompatible with v1, even though every hop passed. Pick _TRANSITIVE for any long-lived topic where consumer lag can stretch past a release cycle.

If you want to drill Data Engineering interview questions like these every day, NAILDD is launching with 1,500+ DE problems covering exactly this surface area.

FAQ

Protobuf vs Avro for schema evolution — which is friendlier?

Both support evolution but with different defaults. Protobuf (proto3) treats every field as optional with implicit zero-value defaults, making "add a field" almost frictionless — but also making it easy to ship a logical breaking change without the wire format noticing. Avro forces explicit defaults in the schema, which is more ceremony but catches more bugs at registration. Most Kafka shops at scale end up on Avro plus Confluent Schema Registry; most internal RPC shops end up on Protobuf because gRPC pairs with it natively.

Can I roll back a schema after registering a new version?

Conceptually no — schema IDs are append-only and the registry won't let you delete versions in normal operation. Practically, you re-register the older schema as a new version, provided it passes the compatibility check against the current head. The cleanest rollback is to fix forward: ship a new version that restores the old behavior. The escape hatch for genuine emergencies is a new topic with a fresh subject — never touch a topic with active consumers.

What's the practical difference between BACKWARD and BACKWARD_TRANSITIVE?

BACKWARD checks only against the previous version. Over many releases, you can legally drift to a schema incompatible with anything older than N-1, even though every individual step was approved. BACKWARD_TRANSITIVE checks against the entire version history — slower to compute, but guarantees a v8 consumer can read a v1 record. For long-lived topics with consumer lag stretching past a release cycle, transitive is the safer default.

How does Iceberg rename a column without rewriting the data?

Every Iceberg column has a stable integer field ID assigned at creation and embedded in every Parquet data file's metadata. The human-readable name lives only in the catalog. Renaming changes the catalog entry; Parquet files still reference the same field ID. Readers translate the new name to the field ID at plan time. Plain Parquet has no such indirection — names are the identity — which is why rename without rewrite is impossible there.

Is schema evolution different in Snowflake or BigQuery?

Both have their own DDL semantics. Snowflake supports add column, type widening, and drop column natively. BigQuery supports add and drop column and type relaxation (REQUIRED → NULLABLE) with limited type changes. Both can read external Iceberg or Delta tables — evolution rules then come from the table format, not the warehouse. The interview answer: warehouse-native is simpler day-to-day; external Iceberg gives portability across engines.