Lakehouse, Iceberg, and Delta for DE interviews
Contents:
Lake vs lakehouse vs DWH
If a Snowflake or Databricks interviewer opens with "walk me through your storage layer", the answer they want to hear involves three terms — data warehouse, data lake, and lakehouse — and a clear story about which problems each one solves. Get the framing wrong and the rest of the loop becomes an uphill battle.
A data warehouse stores structured tables with strong schemas, ACID guarantees, and tight query latency. Think Snowflake, BigQuery, Redshift, ClickHouse. It is expensive per terabyte, but indexing and statistics make analytical queries fly. A data lake is the opposite extreme: raw files in object storage (S3, GCS, ADLS, HDFS), any format, schema-on-read. Cheap, but no transactions, no UPDATE/DELETE, no schema enforcement. A lakehouse is a data lake plus a metadata-aware table layer that gives you ACID, schema evolution, and time travel on top of the same cheap object storage.
The pain a lakehouse solves is concrete. Two Spark jobs writing to the same Parquet directory at once produce partially overwritten files, and readers see torn state. Iceberg and Delta fix this by making writes atomic at the table level and exposing snapshots instead of raw file listings.
Interview answer: "Lakehouse is an architecture that combines cheap object storage with warehouse-grade table semantics via an open table format — Iceberg, Delta, or Hudi. You get ACID, schema evolution, and time travel on S3 without paying warehouse storage prices."
Why table formats exist
Without a table format, a table is just a folder of Parquet files:
s3://bucket/orders/
year=2026/month=05/day=01/part-0000.parquet
year=2026/month=05/day=01/part-0001.parquet
...That layout has four structural problems. No transactions means a Spark job that dies halfway leaves half-written files behind, and concurrent readers see corrupted state. No UPDATE or DELETE means a single-row fix requires rewriting the entire partition. Schema evolution is manual — add a column and older files become incompatible with new readers. And there is no time travel, so yesterday's snapshot is gone the moment the next batch lands.
A table format adds a metadata layer that solves all four:
s3://bucket/orders/
data/ ← parquet files as before
metadata/ ← manifests, versions, snapshots
snap-1234.avro
v0.metadata.json
...The metadata enumerates which data files belong to the table at a given version. Writing becomes "append a new metadata version that references new data files", and the swap is atomic. Readers always see a consistent snapshot. This is the single mental model that separates a senior DE answer from a junior one.
Apache Iceberg
Apache Iceberg is an open-source table format originally built at Netflix and donated to Apache. It is supported by Spark, Trino, Flink, Snowflake, BigQuery, ClickHouse, and DuckDB — the broadest engine support of any format in 2026.
Key properties worth memorizing for the loop:
- Snapshot-based — every write produces a new snapshot pointing at a list of manifest files
- ACID on a single table even on plain S3, via atomic catalog pointer swaps
- Hidden partitioning — the partition is derived from a column expression like
days(event_time), so query authors do not need to know the partition column - Schema evolution — add, drop, rename, reorder columns with no file rewrites
- Partition evolution — change partitioning strategy without migrating existing data
- Time travel via
FOR VERSION AS OForFOR TIMESTAMP AS OF
-- Spark SQL on Iceberg
CREATE TABLE catalog.db.orders (
order_id BIGINT,
user_id BIGINT,
amount DECIMAL(18, 2),
event_time TIMESTAMP
) USING iceberg
PARTITIONED BY (days(event_time));
UPDATE catalog.db.orders SET amount = 0 WHERE order_id = 123;
SELECT * FROM catalog.db.orders.history; -- snapshot log
SELECT * FROM catalog.db.orders FOR TIMESTAMP AS OF '2026-05-01 00:00:00';The catalog is the single source of transactional truth. Common implementations are Hive Metastore, AWS Glue, Nessie, and the new REST catalog spec (Polaris, Tabular, Unity Catalog). Standardize on one catalog per environment — mixing them is the fastest way to produce ghost tables and silent data loss.
Delta Lake
Delta Lake is the table format that ships with Databricks and was open-sourced under the Linux Foundation. It has the deepest Spark integration of the three and a very large install base inside Databricks customers.
Distinguishing features:
- Transaction log lives in
_delta_log/as JSON commit files alongside the data - ACID,
MERGE,UPDATE,DELETEall supported - Time travel by version or timestamp
- Schema evolution via
mergeSchemaor explicitALTER TABLE - Z-ordering — multi-dimensional clustering via a space-filling curve, improves pruning across several predicates at once
OPTIMIZE— compact small files into target-sized ones
# PySpark + Delta
(df.write
.format("delta")
.partitionBy("event_date")
.save("s3://bucket/orders/"))
from delta.tables import DeltaTable
delta_t = DeltaTable.forPath(spark, "s3://bucket/orders/")
delta_t.update(condition="order_id = 123", set={"amount": "0"})Delta stores metadata in _delta_log/ next to the data, so a Hive Metastore is optional, not required. Historically Delta was tightly coupled to Spark and Databricks runtime, but Delta 3.0+ with UniForm dramatically improved interop with Trino, Flink, Athena, and even Iceberg readers.
Apache Hudi
Apache Hudi, built at Uber, is the third major format. Its differentiator is a clean split between two storage modes:
- Copy-on-Write (CoW) — every write rewrites affected files. Optimized for read-heavy workloads where queries dominate.
- Merge-on-Read (MoR) — writes append to delta log files, periodic compaction merges them into base files. Optimized for write-heavy, near-real-time pipelines.
Hudi is the go-to choice for high-throughput upsert pipelines — ride-share trip updates, marketplace order state machines, IoT telemetry. In most interview loops you only need to articulate the CoW vs MoR trade-off and explain when each one wins; deep Hudi internals are rarely asked unless the team is already running Hudi in production.
Picking a format
| Attribute | Iceberg | Delta | Hudi |
|---|---|---|---|
| Governance | Apache | Linux Foundation | Apache |
| Engine support | Broad, multi-engine | Spark first-class, Trino/Flink via 3.0+ | Spark, Flink |
| Schema evolution | Strict, no rewrite | Strict, no rewrite | Flexible |
| Partition evolution | Yes | No | No |
| Hidden partitioning | Yes | No | No |
| Time travel | Yes | Yes | Yes |
MERGE / UPDATE / DELETE |
Yes | Yes | Yes |
| Read-heavy workloads | Strong | Strong | CoW |
| Write-heavy near-real-time | OK | OK | MoR — best |
When the interviewer asks "which one would you pick?", give a workload-driven answer rather than a flavor-of-the-month answer:
- Iceberg — multi-engine stack (Spark + Trino + Athena + Snowflake external tables), strong open-source posture, need for partition evolution
- Delta — Spark/Databricks-centric stack, want the simplest path with first-class engine support
- Hudi — heavy upsert volume with sub-minute freshness targets, CDC ingest from operational databases at high QPS
In 2026 the practical landscape is Iceberg vs Delta, with Iceberg gaining momentum in open-source-first shops and Delta dominating Databricks accounts. Hudi keeps a strong niche in streaming-upsert workloads.
Common pitfalls
The most common production failure is small-file blowup. Streaming ingestion into Iceberg or Delta produces thousands of tiny files per hour, and after a few weeks reads slow to a crawl because the engine spends more time opening files than reading data. The fix is to schedule compaction — Iceberg's rewrite_data_files action or Delta's OPTIMIZE — and treat it as part of the pipeline, not an afterthought.
A second trap is infinite time travel storage. Every snapshot holds onto its data files, so retention grows linearly with write frequency. Without expire_snapshots on Iceberg or VACUUM on Delta, storage bills balloon and metadata listings get slow. Pick a retention policy (often 7 to 30 days) and enforce it in scheduled maintenance.
A third pitfall is assuming partition evolution works everywhere. Iceberg supports it natively — you can switch from daily to hourly partitioning without rewriting old data. Delta does not. If you anticipate changing the partition strategy as data volume grows, that single feature can be the deciding factor between formats.
The fourth trap is catalog drift. Pointing Iceberg writers at one catalog (say Glue) and readers at another (say Hive Metastore) produces split-brain tables where writes are invisible to half your consumers. Pick one catalog per environment, lock it down with IAM, and document it on the team wiki.
The fifth, and most operationally dangerous, is mutating files outside the format. Someone runs aws s3 cp to add a file directly into data/, the manifest is never updated, and the file is invisible to readers. Worse: someone runs aws s3 rm on a file that the latest snapshot still references and reads start failing. All writes must go through the format API, full stop.
Treating the lakehouse as a drop-in replacement for the warehouse is the sixth and most strategic mistake. For interactive BI with sub-second latency, dashboards with hundreds of concurrent users, and tight SLAs, a dedicated warehouse (Snowflake, BigQuery, ClickHouse) is still faster and easier to operate. The realistic 2026 pattern is warehouse for hot interactive data, lakehouse for batch ETL, ML feature stores, and cold archival.
Related reading
- Apache Iceberg deep dive
- Iceberg time travel patterns
- Spark broadcast joins
- Spark Catalyst and AQE
- CDC and Debezium
If you want to drill data engineering interview questions like this every day, NAILDD ships 500+ DE problems across exactly this pattern — lakehouse, Spark, Airflow, dbt, SQL.
FAQ
Can I migrate a Parquet table to Iceberg in place?
Yes, via the system.migrate procedure or by manually creating an Iceberg table over the existing Parquet directory. Existing files are not rewritten — only a metadata layer is added. The catch is that all subsequent writes must go through Iceberg; any direct file manipulation will corrupt the manifest.
Will lakehouse fully replace the warehouse?
For batch ETL, ML training data, and cold archival the answer is increasingly yes. For interactive BI with sub-second response times, concurrent dashboard load, and complex joins on hot data, dedicated warehouses still win on latency and operational simplicity. Most large data teams in 2026 run a hybrid — warehouse for hot, lakehouse for cold, with the same table format readable from both.
What is the difference between a catalog and a metastore?
The Hive Metastore is the original Hive-era catalog with limited transactional semantics. Modern catalogs — AWS Glue, Nessie, Apache Polaris, Databricks Unity Catalog — support multi-table transactions, fine-grained access control, and lineage. Iceberg defines a catalog interface with many implementations; Delta historically relied on the Hive Metastore or the Databricks-specific Unity Catalog.
Is streaming into Iceberg or Delta production-ready?
Yes, with caveats. Spark Structured Streaming writes to both formats reliably, and Flink has first-class Iceberg support. For freshness below one minute, Hudi MoR or specialized real-time stores (ClickHouse, Apache Druid, Apache Pinot) are usually a better fit. Always pair streaming writes with a compaction job — the small-file problem is unavoidable otherwise.
What is Z-ordering in Delta?
Z-ordering linearizes data along several columns using a space-filling curve, so files cluster by multiple dimensions at once. The benefit is multi-column pruning — a query filtering by both user_id and event_date skips far more files than with single-column sorting. Iceberg's equivalent is sort order on write, configurable per partition spec.
Is this official documentation?
No. This guide is a synthesis of the Apache Iceberg, Delta Lake, and Apache Hudi public documentation plus common lakehouse production practice. Always validate behavior against the format version you actually run.