May 7, 2026·11 min read

Lakehouse, Iceberg, and Delta for DE interviews

Q: Can I migrate a Parquet table to Iceberg in place?

Yes, via the `system.migrate` procedure or by manually creating an Iceberg table over the existing Parquet directory. Existing files are not rewritten — only a metadata layer is added. The catch is that all subsequent writes must go through Iceberg; any direct file manipulation will corrupt the manifest.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Contents:

Lake vs lakehouse vs DWH
Why table formats exist
Apache Iceberg
Delta Lake
Apache Hudi
Picking a format
Common pitfalls
Related reading
FAQ

Lake vs lakehouse vs DWH

If a Snowflake or Databricks interviewer opens with "walk me through your storage layer", the answer they want to hear involves three terms — data warehouse, data lake, and lakehouse — and a clear story about which problems each one solves. Get the framing wrong and the rest of the loop becomes an uphill battle.

A data warehouse stores structured tables with strong schemas, ACID guarantees, and tight query latency. Think Snowflake, BigQuery, Redshift, ClickHouse. It is expensive per terabyte, but indexing and statistics make analytical queries fly. A data lake is the opposite extreme: raw files in object storage (S3, GCS, ADLS, HDFS), any format, schema-on-read. Cheap, but no transactions, no UPDATE/DELETE, no schema enforcement. A lakehouse is a data lake plus a metadata-aware table layer that gives you ACID, schema evolution, and time travel on top of the same cheap object storage.

The pain a lakehouse solves is concrete. Two Spark jobs writing to the same Parquet directory at once produce partially overwritten files, and readers see torn state. Iceberg and Delta fix this by making writes atomic at the table level and exposing snapshots instead of raw file listings.

Interview answer: "Lakehouse is an architecture that combines cheap object storage with warehouse-grade table semantics via an open table format — Iceberg, Delta, or Hudi. You get ACID, schema evolution, and time travel on S3 without paying warehouse storage prices."

Why table formats exist

Without a table format, a table is just a folder of Parquet files:

s3://bucket/orders/
  year=2026/month=05/day=01/part-0000.parquet
  year=2026/month=05/day=01/part-0001.parquet
  ...

That layout has four structural problems. No transactions means a Spark job that dies halfway leaves half-written files behind, and concurrent readers see corrupted state. No UPDATE or DELETE means a single-row fix requires rewriting the entire partition. Schema evolution is manual — add a column and older files become incompatible with new readers. And there is no time travel, so yesterday's snapshot is gone the moment the next batch lands.

A table format adds a metadata layer that solves all four:

s3://bucket/orders/
  data/                       ← parquet files as before
  metadata/                   ← manifests, versions, snapshots
    snap-1234.avro
    v0.metadata.json
    ...

The metadata enumerates which data files belong to the table at a given version. Writing becomes "append a new metadata version that references new data files", and the swap is atomic. Readers always see a consistent snapshot. This is the single mental model that separates a senior DE answer from a junior one.

Apache Iceberg

Apache Iceberg is an open-source table format originally built at Netflix and donated to Apache. It is supported by Spark, Trino, Flink, Snowflake, BigQuery, ClickHouse, and DuckDB — the broadest engine support of any format in 2026.

Key properties worth memorizing for the loop:

Snapshot-based — every write produces a new snapshot pointing at a list of manifest files
ACID on a single table even on plain S3, via atomic catalog pointer swaps
Hidden partitioning — the partition is derived from a column expression like days(event_time), so query authors do not need to know the partition column
Schema evolution — add, drop, rename, reorder columns with no file rewrites
Partition evolution — change partitioning strategy without migrating existing data
Time travel via FOR VERSION AS OF or FOR TIMESTAMP AS OF

-- Spark SQL on Iceberg
CREATE TABLE catalog.db.orders (
    order_id   BIGINT,
    user_id    BIGINT,
    amount     DECIMAL(18, 2),
    event_time TIMESTAMP
) USING iceberg
PARTITIONED BY (days(event_time));

UPDATE catalog.db.orders SET amount = 0 WHERE order_id = 123;

SELECT * FROM catalog.db.orders.history;  -- snapshot log
SELECT * FROM catalog.db.orders FOR TIMESTAMP AS OF '2026-05-01 00:00:00';

The catalog is the single source of transactional truth. Common implementations are Hive Metastore, AWS Glue, Nessie, and the new REST catalog spec (Polaris, Tabular, Unity Catalog). Standardize on one catalog per environment — mixing them is the fastest way to produce ghost tables and silent data loss.

Delta Lake

Delta Lake is the table format that ships with Databricks and was open-sourced under the Linux Foundation. It has the deepest Spark integration of the three and a very large install base inside Databricks customers.

Distinguishing features:

Transaction log lives in _delta_log/ as JSON commit files alongside the data
ACID, MERGE, UPDATE, DELETE all supported
Time travel by version or timestamp
Schema evolution via mergeSchema or explicit ALTER TABLE
Z-ordering — multi-dimensional clustering via a space-filling curve, improves pruning across several predicates at once
OPTIMIZE — compact small files into target-sized ones

# PySpark + Delta
(df.write
    .format("delta")
    .partitionBy("event_date")
    .save("s3://bucket/orders/"))

from delta.tables import DeltaTable
delta_t = DeltaTable.forPath(spark, "s3://bucket/orders/")
delta_t.update(condition="order_id = 123", set={"amount": "0"})

Delta stores metadata in _delta_log/ next to the data, so a Hive Metastore is optional, not required. Historically Delta was tightly coupled to Spark and Databricks runtime, but Delta 3.0+ with UniForm dramatically improved interop with Trino, Flink, Athena, and even Iceberg readers.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Apache Hudi

Apache Hudi, built at Uber, is the third major format. Its differentiator is a clean split between two storage modes:

Copy-on-Write (CoW) — every write rewrites affected files. Optimized for read-heavy workloads where queries dominate.
Merge-on-Read (MoR) — writes append to delta log files, periodic compaction merges them into base files. Optimized for write-heavy, near-real-time pipelines.

Hudi is the go-to choice for high-throughput upsert pipelines — ride-share trip updates, marketplace order state machines, IoT telemetry. In most interview loops you only need to articulate the CoW vs MoR trade-off and explain when each one wins; deep Hudi internals are rarely asked unless the team is already running Hudi in production.

Picking a format

Attribute	Iceberg	Delta	Hudi
Governance	Apache	Linux Foundation	Apache
Engine support	Broad, multi-engine	Spark first-class, Trino/Flink via 3.0+	Spark, Flink
Schema evolution	Strict, no rewrite	Strict, no rewrite	Flexible
Partition evolution	Yes	No	No
Hidden partitioning	Yes	No	No
Time travel	Yes	Yes	Yes
`MERGE` / `UPDATE` / `DELETE`	Yes	Yes	Yes
Read-heavy workloads	Strong	Strong	CoW
Write-heavy near-real-time	OK	OK	MoR — best

When the interviewer asks "which one would you pick?", give a workload-driven answer rather than a flavor-of-the-month answer:

Iceberg — multi-engine stack (Spark + Trino + Athena + Snowflake external tables), strong open-source posture, need for partition evolution
Delta — Spark/Databricks-centric stack, want the simplest path with first-class engine support
Hudi — heavy upsert volume with sub-minute freshness targets, CDC ingest from operational databases at high QPS

In 2026 the practical landscape is Iceberg vs Delta, with Iceberg gaining momentum in open-source-first shops and Delta dominating Databricks accounts. Hudi keeps a strong niche in streaming-upsert workloads.

Common pitfalls

The most common production failure is small-file blowup. Streaming ingestion into Iceberg or Delta produces thousands of tiny files per hour, and after a few weeks reads slow to a crawl because the engine spends more time opening files than reading data. The fix is to schedule compaction — Iceberg's rewrite_data_files action or Delta's OPTIMIZE — and treat it as part of the pipeline, not an afterthought.

A second trap is infinite time travel storage. Every snapshot holds onto its data files, so retention grows linearly with write frequency. Without expire_snapshots on Iceberg or VACUUM on Delta, storage bills balloon and metadata listings get slow. Pick a retention policy (often 7 to 30 days) and enforce it in scheduled maintenance.

A third pitfall is assuming partition evolution works everywhere. Iceberg supports it natively — you can switch from daily to hourly partitioning without rewriting old data. Delta does not. If you anticipate changing the partition strategy as data volume grows, that single feature can be the deciding factor between formats.

The fourth trap is catalog drift. Pointing Iceberg writers at one catalog (say Glue) and readers at another (say Hive Metastore) produces split-brain tables where writes are invisible to half your consumers. Pick one catalog per environment, lock it down with IAM, and document it on the team wiki.

The fifth, and most operationally dangerous, is mutating files outside the format. Someone runs aws s3 cp to add a file directly into data/, the manifest is never updated, and the file is invisible to readers. Worse: someone runs aws s3 rm on a file that the latest snapshot still references and reads start failing. All writes must go through the format API, full stop.

Treating the lakehouse as a drop-in replacement for the warehouse is the sixth and most strategic mistake. For interactive BI with sub-second latency, dashboards with hundreds of concurrent users, and tight SLAs, a dedicated warehouse (Snowflake, BigQuery, ClickHouse) is still faster and easier to operate. The realistic 2026 pattern is warehouse for hot interactive data, lakehouse for batch ETL, ML feature stores, and cold archival.

If you want to drill data engineering interview questions like this every day, NAILDD ships 500+ DE problems across exactly this pattern — lakehouse, Spark, Airflow, dbt, SQL.

FAQ

Can I migrate a Parquet table to Iceberg in place?

Yes, via the system.migrate procedure or by manually creating an Iceberg table over the existing Parquet directory. Existing files are not rewritten — only a metadata layer is added. The catch is that all subsequent writes must go through Iceberg; any direct file manipulation will corrupt the manifest.

Will lakehouse fully replace the warehouse?

For batch ETL, ML training data, and cold archival the answer is increasingly yes. For interactive BI with sub-second response times, concurrent dashboard load, and complex joins on hot data, dedicated warehouses still win on latency and operational simplicity. Most large data teams in 2026 run a hybrid — warehouse for hot, lakehouse for cold, with the same table format readable from both.

What is the difference between a catalog and a metastore?

The Hive Metastore is the original Hive-era catalog with limited transactional semantics. Modern catalogs — AWS Glue, Nessie, Apache Polaris, Databricks Unity Catalog — support multi-table transactions, fine-grained access control, and lineage. Iceberg defines a catalog interface with many implementations; Delta historically relied on the Hive Metastore or the Databricks-specific Unity Catalog.

Is streaming into Iceberg or Delta production-ready?

Yes, with caveats. Spark Structured Streaming writes to both formats reliably, and Flink has first-class Iceberg support. For freshness below one minute, Hudi MoR or specialized real-time stores (ClickHouse, Apache Druid, Apache Pinot) are usually a better fit. Always pair streaming writes with a compaction job — the small-file problem is unavoidable otherwise.

What is Z-ordering in Delta?

Z-ordering linearizes data along several columns using a space-filling curve, so files cluster by multiple dimensions at once. The benefit is multi-column pruning — a query filtering by both user_id and event_date skips far more files than with single-column sorting. Iceberg's equivalent is sort order on write, configurable per partition spec.

Is this official documentation?

No. This guide is a synthesis of the Apache Iceberg, Delta Lake, and Apache Hudi public documentation plus common lakehouse production practice. Always validate behavior against the format version you actually run.