ClickHouse projections on the DE interview

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why projections show up in DE interviews

When a recruiter at Snowflake, Databricks, or a high-throughput shop like Uber or DoorDash schedules a ClickHouse round, projections are one of the four or five features that almost always come up. They sit at the intersection of three things interviewers care about: storage layout, query optimizer behavior, and write amplification trade-offs. If you can explain projections without hand-waving, you have implicitly demonstrated that you understand MergeTree parts, sort keys, and how analytical engines reuse pre-computed data.

The second reason is practical. Most teams that adopt ClickHouse hit the same wall around month six: one table is queried five different ways, the original sort key helps only one of them, and copying data into separate materialized views feels painful. Projections were added in version 21.3 to solve exactly that. If you have ever sat through a query-tuning standup, this is the feature people reach for before adding a second table.

The answer also separates seniors from juniors quickly. Anyone who read the docs can recite "projections are pre-computed alternative orderings." A senior candidate goes further: automatic optimizer rewriting, the cost on INSERT, the failure modes when projections diverge during merges, and when an explicit materialized view is still the right tool.

What a projection actually is

A projection is a second physical copy of a table's data, sorted differently or pre-aggregated, stored inside the same MergeTree parts as the parent table. Think of it as a sibling index that stores full rows (or aggregated rows) rather than pointers. When a query arrives, the optimizer inspects the parent table and every projection, then picks whichever minimizes rows scanned.

The mental model: a MergeTree table has exactly one sort key, so it is fast for queries that filter on the leading columns of that key and progressively slower for everything else. A projection lets you keep the parent sorted one way while serving certain queries from a copy sorted another way. Data stays in lockstep because writes go through both representations atomically inside the same part.

Two flavors matter on the interview. A normal projection stores the same rows with a different sort order, useful when you have multiple high-traffic filter columns. An aggregating projection stores GROUP BY results, so count() by user_id can be answered from a tiny pre-aggregated structure instead of scanning billions of raw events.

Load-bearing trick: projections live inside the parent table's parts. They are merged, mutated, and dropped together. That is what makes them automatic — and what makes them more expensive on INSERT than you might expect.

Creating and materializing a projection

The syntax is two statements: one to declare the projection, one to backfill historical data.

ALTER TABLE events
  ADD PROJECTION p_by_user
  (
    SELECT
      user_id,
      event_type,
      count()
    FROM events
    GROUP BY user_id, event_type
  );

ALTER TABLE events MATERIALIZE PROJECTION p_by_user;

The first statement registers the projection definition in the table metadata and starts applying it to all new inserts. The second one walks every existing part and builds the projection for the historical rows. On a large table this second step is the expensive one — it is essentially a full rewrite of all data into the projection layout, and it runs in the background by part.

For a non-aggregated projection you simply omit the GROUP BY:

ALTER TABLE events
  ADD PROJECTION p_sort_by_country
  (
    SELECT *
    ORDER BY country, event_time
  );

This is the one to reach for when several teams filter the same table by country but the parent table is sorted by event_time.

Projection vs materialized view

Both features pre-compute data. The differences are where they live and who picks them.

Aspect Projection Materialized view
Physical location Inside parent table parts Separate table
Query rewriting Automatic by optimizer Manual — query the MV directly
INSERT cost Pays for every projection per part Pays for every MV's transformation
Schema flexibility Same row shape, limited transforms Arbitrary SELECT, joins, dictionaries
Backfill MATERIALIZE PROJECTION rewrites parts POPULATE or manual INSERT SELECT
Drop / rebuild Atomic with parent table Independent lifecycle
Cross-table sources No Yes — can read from any source
Best for Multiple sort orders, simple roll-ups Joins, denormalization, fan-out

The one-line summary interviewers want: projections are automatic and tightly coupled, materialized views are explicit and flexible. If your workload is "same table, multiple access patterns," reach for a projection. If it is "combine three sources into one consumption layer," reach for a materialized view.

This is also why most production ClickHouse deployments end up with both — projections for the hot single-table access patterns, materialized views for the denormalized consumption tables that BI tools hit.

How the optimizer picks a projection

The optimizer runs a cost estimate for the parent table and each projection, then chooses the cheapest plan. "Cheapest" here is dominated by rows-to-scan, which in turn depends on the sort key. A query whose WHERE clause matches the leading columns of a projection's sort key can skip most parts; a query that does not match falls back to the parent table.

Here is what that looks like in practice. Imagine events is sorted by event_time, with a projection p_by_user sorted by user_id:

-- Filters on user_id → projection p_by_user wins.
SELECT count()
FROM events
WHERE user_id = 42;

-- Filters on event_time → parent table wins.
SELECT count()
FROM events
WHERE event_time > '2026-05-01';

-- Aggregates by user_id → aggregating projection wins (already pre-aggregated).
SELECT user_id, count()
FROM events
GROUP BY user_id;

You do not need to rewrite the query, hint the optimizer, or change client code. The same SELECT count() FROM events WHERE user_id = 42 that used to take seconds drops to milliseconds the moment the projection is materialized. That zero-touch speedup is the headline benefit and the answer interviewers are listening for.

To verify, run EXPLAIN indexes = 1 or check the system.query_log for the projections column. Both will show which projection (if any) served the query.

Sanity check: if EXPLAIN does not mention your projection, the optimizer judged it more expensive than the parent table. That is usually because the projection's leading sort columns do not match the predicate, or the table is so small that the projection lookup cost is not worth it.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Use cases that earn the offer

There are four patterns interviewers reward when you walk through them.

Multiple high-traffic sort orders. A single events table queried by user_id, country, and device_type cannot serve all three filters quickly from one sort key. Two projections covering the two non-primary filters give you three fast access paths with one writable table.

Pre-aggregated dashboards. A daily active users dashboard that scans a billion-row events table every refresh can be served from an aggregating projection in milliseconds. The projection stores (date, user_id) unique pairs, the dashboard hits the projection, the parent table is left alone.

Late-arriving filter columns. A team adds a new analyst dimension six months in — say, experiment_bucket. Instead of resorting the entire parent table (expensive, disruptive), add a projection sorted by experiment_bucket, event_time. New queries get the speedup; old queries are unaffected.

Materializing expensive computed columns. A projection definition can include calculated expressions — extracted JSON fields, hashed user IDs, normalized timestamps. The expression runs once on INSERT, and every query that uses that column reads the pre-computed value.

Common pitfalls

When candidates lose points on this question, it is almost always one of the following.

The first trap is underestimating INSERT amplification. Every projection adds work to every INSERT — the engine computes its columns, sorts them, and writes them alongside the parent part. A table with three heavy projections can see INSERT throughput drop by 40-60%. Budget projections deliberately: keep them few, and only for queries that are both frequent and slow on the parent table.

The second pitfall is projecting too soon. New ClickHouse users add projections on day one and discover at scale that they have committed to a layout that is hard to change. Unlike a materialized view, you cannot easily backfill a modified projection — you drop it and rebuild from scratch, a full table rewrite. Wait for a real query pattern with measurable pain before committing.

The third trap is assuming projections handle every query shape. Projections only help when the optimizer can prove the projection covers the query. Subqueries with JOINs, complex UNION ALL, and queries using non-deterministic functions often fall back to the parent table. Always confirm with EXPLAIN rather than assuming.

The fourth pitfall is forgetting that DROP PARTITION drops projection data too. A team retains 90 days of raw events but wants a year of pre-aggregated daily counts. That does not work — the projection is bound to the same parts, so dropping a part drops its projection rows. For independent retention you need a materialized view.

The fifth trap is mixing projections with ALTER UPDATE mutations. Mutations rewrite parts and must rebuild every projection in the affected parts. On a table with several large projections, what looks like a small UPDATE WHERE can take hours. Batch updates into rebuilds when possible.

Optimization and operational tips

A few patterns separate teams that run projections smoothly from teams that fight them.

Use system.projection_parts to monitor projection size and merge state. A projection consistently larger than the parent columns it covers signals an over-wide definition — narrow it to only what queries need. Aim for 10-40% of the parent column footprint, not 100%.

Materialize projections off-hours when possible. MATERIALIZE PROJECTION on a multi-terabyte table can saturate disk IO and noisy-neighbor live queries. Throttle with mutations_sync = 0 and background_pool_size tuning.

For new tables, declare projections in the original CREATE TABLE rather than adding them later — the backfill is then incremental as data flows in.

When a projection seems unused, check system.query_log.projections. If it stays empty, the sort key or aggregation does not match query patterns. Drop it; an unused projection is pure cost.

Finally, treat projection definitions as code. Version them in your schema migration tool, review them in PRs, and track ownership. Projections drift silently otherwise, and at year two no one remembers why p_by_device_country exists.

If you want to drill ClickHouse and DE questions every day, NAILDD ships 500+ problems calibrated to the same patterns FAANG interviewers use.

FAQ

When should I add a projection versus a materialized view?

Reach for a projection when the work stays inside one table and the win is a different sort order or a simple roll-up of the same rows. The optimizer picks it automatically, ops stays simple, and you do not have to teach analysts a new table name. Reach for a materialized view when you need cross-table joins, dictionary lookups, denormalization across sources, or independent retention. The trade-off is concrete: projections optimize a single table's access patterns; materialized views build new shapes from multiple sources.

How much does a projection slow down inserts?

A rule of thumb from production deployments: each projection adds 10-25% to INSERT latency for the rows it covers, depending on how much sorting and aggregation it does. Three projections will usually cost you 40-60% of raw INSERT throughput. This is workload-dependent — wide rows and complex projection expressions can cost more — but the takeaway is that projections are not free and the cost compounds linearly with how many you have.

Can projections work with ReplicatedMergeTree?

Yes. Projections are part of the table's data, so they replicate with the parent parts. ALTER TABLE ... ADD PROJECTION propagates across the cluster, and MATERIALIZE PROJECTION runs on each replica. The one caveat is that during a long materialization, replicas can be temporarily inconsistent — pin queries to a single replica if that matters.

Do projections support TTL?

A projection inherits the TTL of the parent part it lives in. When the parent part is deleted by TTL, the projection rows in that part go with it. There is no way to give a projection its own retention — that is one of the things materialized views are for. If your interviewer asks about independent TTL for a roll-up, the correct answer is "materialized view, not projection."

What happens if a query is ambiguous between two projections?

The optimizer estimates rows-to-scan for each candidate and picks the smallest. In practice it is rare for two projections to match a query equally well; usually one has a better leading sort column or a pre-aggregation that wins decisively. Inspect the decision with EXPLAIN indexes = 1. Two overlapping projections is a schema smell — collapse them.

Are projections supported on non-MergeTree engines?

No. Projections are a MergeTree family feature — they rely on the parts mechanism that engines like Log, Memory, and Distributed do not provide. On a Distributed table you define projections on the underlying MergeTree tables on each shard. The query routes to the shards, and each shard's optimizer decides locally whether to use a projection.