ClickHouse as a DWH in DE interviews
Contents:
What interviewers actually probe
Data warehouses are the backbone of every Data Engineering loop. When a hiring manager at Stripe, Uber, or DoorDash asks "walk me through your DWH stack," they want to see whether you understand the OLAP vs OLTP split, whether you can defend a MergeTree sort key on the whiteboard, and whether you know when columnar storage is the wrong call. The checklist covers storage classes, a deep dive into one OLAP engine — ClickHouse is the most common probe in 2026 because it ships open source, scales to petabytes on commodity hardware, and powers production analytics at Cloudflare, Uber, eBay, and Sentry.
The second axis is modeling: star schema vs Data Vault vs 3NF, dbt materializations, slowly changing dimensions. The third is the trade-off question — "we have 100 GB of events per day, what would you reach for?" That single prompt forces you to weigh ingest rate, query shape, team skill, and budget in one breath. Most candidates fail it by reciting features instead of comparing constraints.
Load-bearing rule: if you can defend one sort key on one MergeTree table for one realistic workload, you will outperform 80% of candidates who try to sound encyclopedic.
OLAP vs OLTP at the storage layer
OLTP (Online Transaction Processing) systems — Postgres, MySQL, Oracle, CockroachDB — are tuned for short, point-style transactions: insert one order, update one cart, delete one row. They use row-based storage, B-tree indexes on lookup columns, full ACID, and they shine when you read a handful of rows by primary key.
OLAP (Online Analytical Processing) systems — ClickHouse, Snowflake, BigQuery, Redshift, Databricks SQL Warehouse — are tuned for long, aggregation-heavy scans across millions or billions of rows. They use columnar storage, sparse or sort-key indexes, and they often relax ACID in exchange for bulk-ingest throughput.
| Dimension | OLTP | OLAP |
|---|---|---|
| Query shape | Short, point lookups | Long, aggregation scans |
| Write pattern | Single-row INSERT/UPDATE | Bulk INSERT, batched |
| Storage layout | Row-based | Column-based |
| Indexes | B-tree on lookup cols | Sparse, sort key |
| Typical size | GB to a few TB | TB to PB |
| ACID | Full | Often relaxed |
| Concurrent users | Thousands | Tens to hundreds |
The interview follow-up is predictable: "why not just analyze in Postgres?" The honest answer has two parts. On billions of rows, row-based storage forces the engine to read every column even when you only need three, and B-tree indexes degrade to full scans for unselective predicates. Columnar engines read only the requested columns and compress them by 10x to 50x because adjacent values repeat (categories, dates, integers in known ranges). The second part is concurrency model — OLTP tunes for many small transactions, OLAP tunes for fewer, longer queries with heavy CPU and memory.
Why columnar wins for analytics
Columnar storage flips the on-disk layout: instead of storing rows together, all values for a single column live contiguously. This unlocks three compounding wins.
Compression is the headline. A column of timestamps or category codes compresses to a fraction of its raw size because adjacent values are similar. ClickHouse defaults to LZ4 and offers ZSTD for cold data — typical compression ratios sit at 5x to 20x for event tables. This is also why columnar warehouses can store years of history that would bankrupt an OLTP cluster.
Selective reads mean SELECT user_id, event_count FROM events reads only those two column files instead of pulling 50 columns from disk. On a 200-column event table this is a 25x reduction in I/O before any filtering happens.
Vectorized aggregation is the third trick. SUM, COUNT, AVG, and GROUP BY run on packed column blocks using SIMD, often on still-compressed data. The cost: per-row INSERT is slow, UPDATE is expensive or unsupported, full ACID is rare. Every columnar engine — ClickHouse, Snowflake, BigQuery, Vertica, plus Parquet — makes this same bet.
ClickHouse engines and MergeTree
ClickHouse is the workhorse OLAP system in this interview. It is open source, runs anywhere, and the engineering teams at Uber, Cloudflare, GitLab, and Sentry have all blogged about replacing legacy warehouses with it.
The MergeTree family
The default engine is MergeTree. Data lands as sorted parts on disk, and a background merger periodically compacts them into bigger parts. Three settings dominate every interview question:
- PARTITION BY — coarse buckets, usually by month or date. Partition pruning is the first filter applied to every query.
- ORDER BY — the sort key. Data is physically sorted by this tuple inside each part. A bad sort key is the #1 reason ClickHouse queries are slow.
- PRIMARY KEY — usually equal to ORDER BY. ClickHouse builds a sparse index with one entry per 8,192 rows by default.
CREATE TABLE events (
event_date DATE,
user_id UInt64,
event_type LowCardinality(String),
properties Map(String, String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);MergeTree variants
The family extends to handle update-like semantics without true UPDATEs:
- ReplacingMergeTree — on merge, keeps the last row per ORDER BY tuple. Great for dedup or current-state tables.
- SummingMergeTree — sums numeric columns on merge. Perfect for pre-aggregated counters.
- AggregatingMergeTree — stores partial aggregate states via
AggregateFunction. Used with materialized views for rollups. - CollapsingMergeTree — uses a
signcolumn (+1 / -1) to logically delete rows on merge. - VersionedCollapsingMergeTree — same trick with a version column, safer in distributed setups.
The interview question is reliable: "when do you reach for ReplacingMergeTree vs SummingMergeTree?" Replacing answers "give me the latest state per key" (user profile, latest order status). Summing answers "give me a running total per key" (daily counters, billing rollups). If both make sense, AggregatingMergeTree with materialized views usually wins on flexibility.
Distributed tables and replication
For multi-node clusters, ClickHouse uses Distributed tables as a routing layer — the Distributed table holds no data, it forwards reads and writes to shards.
CREATE TABLE events_distributed AS events
ENGINE = Distributed(cluster_name, current_database, events, rand());Pair this with ReplicatedMergeTree for high availability — replicas coordinate through ZooKeeper or ClickHouse Keeper.
Sanity check: if your ClickHouse query is slow, the answer is almost always the sort key, not the cluster size. Re-read the ORDER BY before scaling hardware.
Snowflake, Databricks, and MPP alternatives
A senior interviewer will push beyond ClickHouse. Snowflake separates storage (S3) from compute (warehouses), so you scale clusters per workload and pay per second of compute. It dominates US enterprise analytics. BigQuery is Google's serverless equivalent — no clusters, just queries, with a different cost model based on bytes scanned. Databricks SQL Warehouse runs on top of Delta Lake and shines when ML and analytics share the same lakehouse.
Older MPP engines — Greenplum, Vertica, Teradata — parallelize SQL across segments and handle multi-way joins better than ClickHouse, which is famously weak at large joins. The shorthand: ClickHouse for event analytics, Snowflake for warehouse-of-record with messy joins, Databricks when ML and analytics share a lakehouse.
| Engine | Sweet spot | Watch out for |
|---|---|---|
| ClickHouse | Wide event tables, fast aggregations | Multi-table joins, frequent UPDATEs |
| Snowflake | Mixed workloads, easy scaling | Cost discipline, query bytes |
| BigQuery | Serverless, ad-hoc analytics | Slot contention, scan billing |
| Databricks | ML + analytics on lakehouse | Cluster startup time, ops overhead |
| Redshift | AWS-native warehouse-of-record | Cluster sizing, vacuum / sort keys |
Modeling the warehouse
Kimball star schema is still the default for analytics marts. Fact tables hold events with foreign keys to dimensions — small, slowly-changing reference tables. One fact, many dims, joins on PK/FK. Almost every BI tool assumes this shape.
Inmon 3NF is fully normalized — better for integrating data from many source systems, but query-side cost is high because every analytical question becomes a multi-join.
Data Vault 2.0 breaks the world into Hubs (business keys), Links (relationships), and Satellites (attributes with history). It's resilient to source-system change but unfriendly to analysts, so teams usually expose Kimball marts on top.
dbt is the transformation layer of choice in 2026. The materializations that matter on the whiteboard:
- view — recomputed on every query, free to build, slow to read.
- table — full rebuild each run, simple and predictable.
- incremental — appends or merges new rows, mandatory for fact tables over ~100 million rows.
- ephemeral — inlined CTE, no warehouse object.
The interview prompt: "when do you pick incremental vs table?" Incremental for append-mostly fact tables where rebuilding would burn warehouse credits. Table for dimensions and small daily aggregates where simplicity beats marginal cost.
Picking the right storage for the workload
Gotcha: the trade-off question is almost never about features. It's about constraints — data shape, query shape, ingest rate, team skill, and budget.
A defensible answer maps four constraints to a recommendation:
- Volume — under 1 TB, Postgres with good indexes is usually fine. Past 10 TB, you need a columnar engine.
- Query shape — wide aggregations point to ClickHouse or BigQuery. Multi-way joins favor Snowflake, Redshift, or Databricks.
- Ingest — streaming from Kafka? ClickHouse with the Kafka engine or Databricks structured streaming. Batch only? Any cloud warehouse with Airflow or dbt Cloud.
- Team and budget — Snowflake costs two to five times more than self-hosted ClickHouse but eliminates infra ops. The right answer for a 3-person data team is rarely the right answer for a 50-person platform org.
If you want to drill exactly this kind of system-design and SQL question every day, NAILDD is launching with 500+ DE-style problems built around real interview loops.
Common pitfalls
Reaching for ClickHouse on an OLTP-shaped workload is the most expensive mistake new teams make. UPDATEs and DELETEs in ClickHouse rewrite parts in the background, and frequent point mutations destroy performance. If the workload needs single-row updates by primary key, that's an OLTP problem — use Postgres with proper analytics replicas instead.
Skipping partitioning on multi-terabyte tables is the #2 reason ClickHouse clusters fall over. Without PARTITION BY, every query scans every part, and partition pruning — the cheapest filter ClickHouse has — never fires. Pick partitions that match the typical query window, usually month or day.
Ignoring ORDER BY is the cousin mistake. A MergeTree table without a thoughtful sort key behaves like an unindexed Postgres table — every query becomes a full scan over compressed columns. Sort by the columns you actually filter on, ordered from low cardinality to high.
Building a star schema directly on transactional Postgres is a third trap. Star schema is a logical model that needs a columnar engine to fly. On a row-oriented DB, every aggregation triggers heavy joins, which is exactly the workload Postgres is worst at.
Heavy user-defined functions is the final foot-gun. ClickHouse and BigQuery are vectorized around built-in operators — custom UDFs break vectorization and run 10x to 100x slower than the native equivalent. If a transformation can be expressed with built-ins, do it that way.
Related reading
- ClickHouse MergeTree — DE interview
- ClickHouse projections — DE interview
- ClickHouse vs PostgreSQL for analysts
- DWH layers — DE interview
- DWH modeling — DE interview
- dbt incremental models — DE interview
FAQ
Is ClickHouse enough, or should I also know Snowflake and BigQuery?
Know one OLAP engine deeply and the others conceptually. ClickHouse is a strong default because the internals — sort keys, parts, merges, materialized views — translate directly to interview signal. Snowflake and BigQuery come up constantly in US loops, especially at Stripe, DoorDash, and Notion, but the questions focus on cost model, virtual warehouses or slots, and how separation of storage and compute changes scaling. Two days of hands-on Snowflake plus a strong ClickHouse base will cover most rooms.
What about Iceberg and Delta Lake?
Lakehouse formats — Apache Iceberg, Delta Lake, Apache Hudi — store data as Parquet on object storage and add ACID transactions, schema evolution, and time travel on top. Most new platforms at Netflix, Apple, and Databricks customers default to Iceberg. For the interview, know the three reasons teams adopt them: cheap object storage as the source of truth, engine-agnostic access (Spark, Trino, ClickHouse, Snowflake can all read the same tables), and reliable schema evolution.
How much hands-on ClickHouse do I need?
One real project with a sensibly designed MergeTree table — meaningful PARTITION BY, a defended ORDER BY, and at least one materialized view — beats five tutorials. You should be able to read EXPLAIN PIPELINE output, explain why a query is slow, and propose a fix. Most candidates fail not because they lack scale, but because they cannot defend a single design decision on a single table.
Is dbt expected for a Data Engineer role?
In 2026, yes — dbt is the standard transformation layer at most data-mature companies. Expect prompts about materializations, incremental strategies (append, merge, delete+insert), tests, and how to organize models into staging, intermediate, and marts layers. You should be able to design a small dbt project end to end on a whiteboard.
Is this an official certification guide?
No. This is a field-tested writeup based on public ClickHouse, Snowflake, and dbt documentation plus patterns observed in real Data Engineering interview loops.