ClickHouse MergeTree on a Data Engineering interview
Contents:
Why MergeTree shows up on every DE loop
ClickHouse is the default OLAP engine at most analytics-heavy companies outside the Snowflake/BigQuery garden — Uber, Cloudflare, eBay, GitLab all run multi-petabyte clusters on it. On a Data Engineering loop at Stripe, Linear, Vercel or Databricks, you will get the MergeTree question. Not as trivia, as a 20-minute system design probe where the interviewer wants to hear you reason about ORDER BY choice, partition granularity, and the engine variant for upserts.
The common screen-out is the candidate who answers "MergeTree is the ClickHouse table engine" and stops. The interviewer is fishing for the sparse index trick, the eventual dedup semantics of ReplacingMergeTree, and the moment you realize ORDER BY (id) on an event table will full-scan every dashboard query. Get those three beats right and you pass. This is the whole shape of the interview answer.
Load-bearing trick: ClickHouse skips data by reading the sparse primary index (one entry per index_granularity rows, default 8192), then jumps to the mark file to find the byte offset inside the column file. Everything else — partitions, projections, materialized views — is a layer on top of that.
How MergeTree actually works
MergeTree is structurally an LSM-tree variant — write-optimized, sort-on-disk, merge in the background. An INSERT does not modify an existing file; it creates a new directory called a part holding the sorted rows of that batch. A background thread continuously merges small parts into larger ones, which is where the engine gets its name.
A part contains one file per column (columnar, LZ4 or ZSTD compressed), the primary.idx sparse index, and a .mrk2 mark file per column mapping index entries to byte offsets. On SELECT, ClickHouse scans the sparse index to find matching granules (index_granularity-row blocks), uses the marks to seek into the column files, and returns only the columns you projected. That is the entire read path.
CREATE TABLE events (
event_date DATE,
user_id UInt64,
event_type LowCardinality(String),
amount Float64,
properties Map(String, String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id)
SETTINGS index_granularity = 8192;Four properties to recite on a whiteboard: columnar storage, sorted within a part by the ORDER BY tuple, sparse primary index (8192 rows per entry by default, not per-row like Postgres B-trees), and background merges keeping read amplification bounded.
ORDER BY vs primary key
Two thirds of candidates conflate these. ORDER BY is the physical sort order of rows within every part. Primary key is the prefix of ORDER BY that gets loaded into the sparse primary index in memory. They default to identical; you only split them to save RAM on very wide sort keys.
The rule for choosing ORDER BY columns is counter-intuitive: lowest-cardinality, most-frequently-filtered columns first, then narrow down. For an event log queried by date range and then by user, ORDER BY (event_date, user_id) is correct. WHERE event_date = '2026-05-18' AND user_id = 42 reads roughly one granule. WHERE event_type = 'click' with no date predicate is a full table scan — the index cannot help when the column is not a prefix.
-- Wide ORDER BY, narrower PRIMARY KEY to save RAM
CREATE TABLE pageviews (
event_date DATE,
user_id UInt64,
url String,
referrer String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id, url, referrer)
PRIMARY KEY (event_date, user_id);Here the on-disk sort uses all four columns (better compression for url and referrer since adjacent rows share prefixes), but only the first two enter the sparse index. Index lookup stays cheap; range queries on event_date, user_id still benefit from data locality.
Partitioning done right
PARTITION BY splits the table into separate directories on disk, one per partition key value. The canonical choice for event tables is monthly: PARTITION BY toYYYYMM(event_date). Each partition gets its own set of parts; merges never cross partition boundaries.
The wins are concrete. Partition pruning — WHERE event_date BETWEEN '2026-04-01' AND '2026-04-30' reads one partition directory before the index even runs. Instant drop — ALTER TABLE events DROP PARTITION '202504' is a directory unlink, the standard way to enforce retention. Independent merges — old partitions stop merging once stable, freeing CPU for hot data.
The trap is going too granular. PARTITION BY toDate(event_date) over five years creates 1,825 partitions each holding a thin slice; metadata bloats and the merge scheduler thrashes. Aim for partitions in the tens of gigabytes range. Daily partitions are appropriate only for very high-volume tables (hundreds of billions of rows per year); monthly is the safe default.
| Partition strategy | Partitions over 3 years | Per-partition size (1B rows/yr) | Verdict |
|---|---|---|---|
toYYYYMM(date) |
36 | ~28 GB | Default — works for most |
toMonday(date) |
156 | ~6 GB | Weekly, OK for high volume |
toDate(date) |
1,095 | ~800 MB | Only for >100B rows/year |
(toYYYYMM(date), country) |
36 × N countries | Variable | Only if every query filters by country |
| No partition | 1 | ~85 GB | Cannot drop old data efficiently |
The MergeTree engine family
This table is the one to memorize before the interview. The interviewer will ask "which engine for upserts?" and "which engine for pre-aggregated rollups?" and expect the right name on the first try.
| Engine | What merge does | When to use | Gotcha |
|---|---|---|---|
MergeTree |
Just merges sorted parts | Fact tables, append-only event logs | None — the baseline |
ReplacingMergeTree(version) |
Keeps the row with max version per ORDER BY key |
CDC sinks, dimension upserts, slowly-changing data | Eventual dedup — FINAL forces it but is 2–10× slower |
SummingMergeTree(metrics) |
Sums numeric columns per ORDER BY key | Pre-aggregated counters: clicks, impressions, GMV | Only works for additive metrics; non-numeric kept as first value |
AggregatingMergeTree |
Merges AggregateFunction states |
uniq, quantile, avg rollups via -State/-Merge |
Reads must use -Merge combinator; rookies forget and get raw states |
CollapsingMergeTree(sign) |
Cancels +1/-1 row pairs |
Mutable rows with strict ordering guarantees | Requires app-side discipline; hard to get right |
VersionedCollapsingMergeTree(sign, ver) |
Same as Collapsing plus version tie-break | Concurrent updates from multiple writers | Niche; most teams pick Replacing instead |
ReplacingMergeTree is the workhorse for CDC ingestion via Debezium. You stream change events keyed by primary key, sorted by updated_at, and the engine eventually compacts to the latest version per key. The eventual part trips juniors: between merges, old and new rows live in different parts, so a naive SELECT returns duplicates. Correct patterns: SELECT ... FINAL (forces dedup at query time, expensive) or argMax(value_col, updated_at) GROUP BY key (cheaper, more flexible, the default to reach for).
AggregatingMergeTree powers real-time dashboards. Insert uniqState(user_id) values; on read, uniqMerge(state) finishes the aggregation. Approximate uniq over a billion-row month materializes in under 200 ms on a modest cluster because the heavy work happened on insert.
Sanity check: if your team is reaching for CollapsingMergeTree, stop and ask whether ReplacingMergeTree plus argMax would do the job. In nine cases out of ten it will.
Replicated and Distributed
Independent layers that almost always ship together. ReplicatedMergeTree wraps any MergeTree-family engine with ZooKeeper- or ClickHouse-Keeper-coordinated replication. Each INSERT is logged in Keeper; sibling replicas pull the part. Reads stay local.
CREATE TABLE events_local ON CLUSTER prod
(
event_date DATE,
user_id UInt64,
amount Float64
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events',
'{replica}'
)
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);Distributed is a virtual table that fans queries across shards. On INSERT it hashes the sharding key to pick a target; on SELECT it pushes the query to every shard and merges partial results. Textbook production layout: ReplicatedMergeTree storage on each node, thin Distributed on top for clients.
CREATE TABLE events ON CLUSTER prod AS events_local
ENGINE = Distributed('prod', 'default', 'events_local', cityHash64(user_id));Sharding key matters as much as ORDER BY. Hash by user_id if every query filters by user (co-locates a user's data on one shard). Hash randomly when writes are skewed.
When ClickHouse is the wrong tool
A good interview answer mentions limits. OLTP workloads — frequent single-row UPDATE/DELETE — are expensive because mutations rewrite entire parts; Postgres is correct here. Strict ACID transactions across multiple tables do not exist; atomic single-partition INSERT is the strongest guarantee.
High-QPS single-row lookups ("give me user 12345's profile, 50,000 times per second") are an anti-pattern. ClickHouse is built for scans over millions of rows; one row at a time pays columnar overhead with no upside. Redis or DynamoDB instead. Heavy multi-way JOINs on large tables are also not its strength. Idiomatic answer: denormalize into wide tables, or GLOBAL JOIN with a small dimension. If your workload is fundamentally normalized analytics on terabytes, Snowflake or BigQuery will be less painful.
Common pitfalls
The first and most expensive mistake is ORDER BY (id) on an event-shaped table. Every dashboard query filters on event_date, none of them filter on id, and the sparse index becomes useless. The fix is ORDER BY (event_date, user_id) from day one — and yes, rebuilding the table to fix it later means a full re-ingest. Ask about query patterns before choosing the sort key.
A close second is micro-batching INSERTs. Every INSERT creates a part; the background merger has finite throughput. If your producer inserts a thousand single-row batches per minute, you cross the 300-parts-per-partition threshold and hit Too many parts errors that block writes. Batch on the producer side — 10,000 to 100,000 rows per INSERT is the healthy range. If batching is impossible, put Kafka + a Materialized View between the producer and the table, or use asynchronous inserts (async_insert=1).
Third, ALTER TABLE ... UPDATE on a large table is a mutation, which rewrites every affected part. On a 500 GB table it can run for hours, blocking merges and burning disk. The correct pattern is ReplacingMergeTree for upsert semantics or a soft-delete flag plus periodic partition rebuilds. Mutations are an escape hatch, not a workflow.
Fourth, the FINAL trap. Juniors learn SELECT * FROM table FINAL to deduplicate ReplacingMergeTree and put it in their dashboard queries. Now every query is 2–10× slower because ClickHouse merges parts on the read path. The right pattern is argMax(value, version) GROUP BY key — same result, no penalty, and it works in materialized views.
Fifth, partitioning too granularly as covered above. Daily partitions on a five-year history create 1,825 directories and the merge scheduler grinds. Monthly is the default; go finer only when you measure that you need to.
Related reading
- Apache Iceberg deep dive
- Materialized views for DE interviews
- ClickHouse vs PostgreSQL for analysts
- Table partitioning in SQL
- MERGE and upsert patterns
- EXPLAIN and query plans
If you want to drill ClickHouse and Data Engineering questions like this every day, NAILDD is launching with hundreds of system-design and SQL problems targeted at exactly this loop.
FAQ
How does a materialized view differ from a regular view in ClickHouse?
A ClickHouse materialized view is not a saved query like in Postgres — it is an INSERT trigger that writes into a real target table whenever rows land in the source. The target is a regular MergeTree (often AggregatingMergeTree or SummingMergeTree) and you query it directly. This is why MVs power real-time rollup dashboards: the aggregation happens on write, the read is a normal scan.
What is a projection and when does it beat a materialized view?
Projections (stable since CH 21.x) are alternate sort orders or pre-aggregations stored alongside the main table. The optimizer picks a projection automatically when predicates match — application SQL does not change. MVs require querying the target table explicitly. Projections are simpler for single-table rollups; MVs win for cross-table fan-out and complex transforms.
Why is the default index granularity 8192?
It is an empirical balance between index size in RAM and pruning precision. Smaller granules give finer skipping but inflate the primary index; larger granules save memory but read more excess data per query. 1024 is appropriate for tables with frequent point lookups and small row sizes; 16384 suits long sequential scans on wide rows. Most workloads are fine with the default — measure before tuning.
Can ClickHouse actually delete rows?
Three ways. ALTER TABLE ... DELETE WHERE ... is a mutation — expensive, rewrites parts, suitable for occasional GDPR deletes. CollapsingMergeTree with paired +1/-1 rows is the engine-level solution for mutable data. Most common in production: append-only with a deleted flag and filter it out in views. Fast writes, trivial reads, storage cost is negligible at columnar compression rates.
Is async insert safe for production?
async_insert=1 buffers small INSERTs server-side and flushes them as one part. Solves too-many-parts at the cost of a small ack delay (default flush window: one second). For analytics ingestion it is the right default. For read-your-own-write within milliseconds, stick to client-side batching or use wait_for_async_insert=1.
What is the right way to handle CDC into ClickHouse?
Stream the change feed (Debezium, Kafka Connect, custom) into a ReplicatedReplacingMergeTree(updated_at) keyed by the source primary key. Do not deduplicate at write time. On the read side, build views that use argMax(col, updated_at) GROUP BY pk rather than SELECT ... FINAL — same correctness, an order of magnitude faster, and the pattern composes inside materialized views for downstream rollups.