ClickHouse vs PostgreSQL for analysts
Contents:
Why this comparison matters
PostgreSQL is the default OLTP engine at almost every Series A startup and the dialect you will hit in 80% of analyst interviews. ClickHouse is the default OLAP engine at companies that outgrew their nightly Postgres aggregation — Uber's logistics analytics, Cloudflare's request log warehouse, and a long list of late-stage startups running on billions of rows per day.
The probe at a staff interview is not "do you know SQL" — they assume that. It is whether you understand why SELECT SUM(amount) FROM events WHERE day = today() returns in 180 milliseconds on ClickHouse with 12 billion rows and times out on a Postgres replica with the same data. The answer is column storage, sparse indexes, and merge trees, and it shows up in the way you write queries and pick engines for dashboards.
The architectural split
PostgreSQL stores data row by row. Every row of an orders table lives in one contiguous block on disk, so a point lookup by primary key pulls all 50 columns in a single I/O. That layout, combined with MVCC and per-row locking, is exactly what an OLTP application needs to let thousands of concurrent users update their own orders cheaply.
ClickHouse stores data column by column. Each column lives in its own compressed parts, sorted by the table's ORDER BY key. A query like SUM(amount) WHERE day = '2026-05-17' reads two columns and ignores the other 48. With 10:1 compression on low-cardinality columns and SIMD aggregation over packed integers, the same machine that struggled with a 4-minute Postgres scan finishes in single-digit seconds. The trade-off is that updating one row means rewriting parts of multiple column files, which is why ClickHouse treats updates as asynchronous mutations.
Load-bearing trick: column storage is fast for analytics because you read 2 of 50 columns, not because the disk is magic. If your query touches every column (SELECT *), the column engine has no edge.
Feature comparison table
| Dimension | PostgreSQL | ClickHouse |
|---|---|---|
| Storage layout | Row-oriented (heap + indexes) | Column-oriented (MergeTree parts) |
| Workload sweet spot | OLTP, CRUD, mixed read/write | OLAP, aggregations, time-series scans |
| Aggregation on 100M rows | 30s to several minutes | Sub-second to seconds |
| Single-row INSERT latency | <1 ms, fully transactional | Discouraged; batch inserts of 1k+ rows |
| UPDATE / DELETE | First-class, row-level | Async mutations, expensive |
| Point lookup by PK | <1 ms with B-tree index | Slower; sparse index, ~8k row granule |
| JOIN of two 1B-row tables | Slow, often impractical | Fast but memory-hungry; not idiomatic |
| Concurrency model | MVCC, thousands of writers | Append-heavy, single-writer per shard ideal |
| Schema flexibility | ALTER TABLE is cheap | ALTER on huge tables can be expensive |
| Approximate functions | None native | uniq, quantile, topK, uniqHLL12 |
| Materialized views | Manual REFRESH | Auto-incremental on insert |
| Replication | Streaming WAL replicas | Built-in ReplicatedMergeTree + Keeper |
The row worth memorizing is single-row INSERT: Postgres handles per-row writes all day long, ClickHouse accepts them but the merge tree generates a new part per insert and the compactor falls behind within minutes. The standard fix is to buffer writes — a Buffer table, a Kafka engine table, or an application-side batch of 10k to 100k rows per insert. Engineers who learn this the hard way usually learn it during their first on-call rotation.
Syntax differences that bite
The dialect drift between the two engines is small in volume but high in friction. The functions you reach for daily — date truncation, conditional aggregation, distinct counts, string aggregation — all have different names. Memorize the top ten and your code reviews will go faster.
-- PostgreSQL
SELECT DATE_TRUNC('month', created_at) AS month,
COUNT(*) FILTER (WHERE status = 'paid') AS paid_orders,
SUM(amount) FILTER (WHERE status = 'paid') AS paid_revenue,
COUNT(DISTINCT user_id) AS active_users
FROM orders
WHERE created_at >= NOW() - INTERVAL '90 days'
GROUP BY 1
ORDER BY 1;-- ClickHouse equivalent
SELECT toStartOfMonth(created_at) AS month,
countIf(status = 'paid') AS paid_orders,
sumIf(amount, status = 'paid') AS paid_revenue,
uniq(user_id) AS active_users
FROM orders
WHERE created_at >= now() - INTERVAL 90 DAY
GROUP BY month
ORDER BY month;Three small changes, three gotchas. toStartOfMonth returns a Date rather than a Timestamp, which matters if you later join on the column. countIf and sumIf are conditional aggregates baked into the engine — they vectorize better than FILTER. uniq is an approximate count distinct backed by HyperLogLog with roughly 2% relative error; for exact answers you opt in with uniqExact, which is slower and memory-hungry.
-- Approximate vs exact distinct in ClickHouse
SELECT uniq(user_id) AS approx_users, -- HLL, fast, ~2% error
uniqExact(user_id) AS exact_users, -- exact, slower
uniqHLL12(user_id) AS tunable_hll_users -- 12-bit HLL, smaller memory
FROM events
WHERE event_date = today();ClickHouse's array model is its other superpower for analysts. You collect values with groupArray, transform with arrayMap, filter with arrayFilter, explode with arrayJoin. Postgres has arrays too, but they are not idiomatic for analytics — Postgres analysts reach for STRING_AGG and lateral joins, ClickHouse analysts reach for groupArray and arrayJoin daily.
-- ClickHouse: build per-user event sequences and pick top 5 actions
SELECT user_id,
arraySlice(arrayReverseSort(groupArray(event_type)), 1, 5) AS top_actions
FROM events
WHERE event_date >= today() - 30
GROUP BY user_id;-- ClickHouse: explode an array column back into one row per element
SELECT user_id, arrayJoin(product_ids) AS product_id
FROM user_baskets
WHERE event_date = today();Window functions look identical on the surface — ROW_NUMBER, RANK, LAG, LEAD, SUM(...) OVER. The trap is ROWS BETWEEN n PRECEDING AND CURRENT ROW: ClickHouse supports the common shapes, but for unusual frame definitions or session-style logic you often switch to array tricks. In practice, 90% of the window queries you wrote in Postgres work unchanged in ClickHouse.
What interviewers actually ask
Five questions cover most of the territory at a senior data interview. The phrasing varies, the underlying ideas are stable.
Why is a columnar store faster for analytics? Analytical queries touch a small subset of columns. A 50-column events table aggregated on amount and day reads two files in column storage versus 50 in row storage. Add compression — 5:1 on high-cardinality columns, 30:1 on low-cardinality columns like country — and the I/O bill drops by an order of magnitude. SIMD aggregation multiplies the gap.
Why does ClickHouse discourage UPDATE and DELETE? Each mutation rewrites large chunks of the merge tree and runs as an async job, which means eventual consistency and contention with normal merges. The idiomatic patterns are ReplacingMergeTree (latest row per sort key wins), CollapsingMergeTree (paired insert with sign = -1), or VersionedCollapsingMergeTree for CDC pipelines.
When would you pick Postgres over ClickHouse on a large table? Three cases. Point lookup by primary key — Postgres B-trees return in under a millisecond, ClickHouse pays the 8,192-row granule cost. Row-level updates as core workload — orders, inventory. ACID multi-statement transactions, which ClickHouse does not support at the row level.
What is MergeTree? The default storage engine family. Data is written in immutable parts sorted by ORDER BY key; a background process merges small parts into larger ones, LSM-style. Variants to know: ReplacingMergeTree (deduplication), AggregatingMergeTree (incremental rollups), SummingMergeTree (sums on merge), ReplicatedMergeTree (multi-replica with Keeper).
Write a query that behaves differently in each engine. The classic: SELECT DISTINCT ON (user_id) user_id, amount FROM orders ORDER BY user_id, created_at DESC is Postgres-only. The ClickHouse equivalent is SELECT user_id, argMax(amount, created_at) FROM orders GROUP BY user_id — and it is faster on a 100M-row table because it skips the full sort.
PostgreSQL to ClickHouse cheat sheet
| PostgreSQL | ClickHouse |
|---|---|
DATE_TRUNC('month', dt) |
toStartOfMonth(dt) |
DATE_TRUNC('day', dt) |
toDate(dt) |
EXTRACT(YEAR FROM dt) |
toYear(dt) |
dt + INTERVAL '7 days' |
addDays(dt, 7) or dt + INTERVAL 7 DAY |
AGE(dt1, dt2) |
dateDiff('day', dt2, dt1) |
COUNT(*) FILTER (WHERE x) |
countIf(x) |
SUM(a) FILTER (WHERE x) |
sumIf(a, x) |
AVG(a) FILTER (WHERE x) |
avgIf(a, x) |
COUNT(DISTINCT col) |
uniq(col) or uniqExact(col) |
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) |
quantile(0.5)(col) |
STRING_AGG(col, ',') |
arrayStringConcat(groupArray(col), ',') |
ARRAY_AGG(col) |
groupArray(col) |
JSONB_EXTRACT_PATH_TEXT(j, 'k') |
JSONExtractString(j, 'k') |
DISTINCT ON (col) |
argMax(other_col, order_col) |
GENERATE_SERIES(1, 100) |
numbers(100) |
UNNEST(arr) |
arrayJoin(arr) |
NOW() |
now() |
CURRENT_DATE |
today() |
COALESCE(a, b) |
coalesce(a, b) or ifNull(a, b) |
CASE WHEN ... THEN ... END |
identical |
Pin this to your wall, you will look at it more often than you expect.
Common pitfalls
The most expensive mistake new ClickHouse users make is single-row INSERT loops. A naive ETL that reads from Kafka one message at a time creates one merge tree part per insert and surfaces as Too many parts (300). Merges are processing significantly slower than inserts within an hour. The fix is to batch 10,000 to 100,000 rows per insert, or put a Buffer table or Kafka engine table between producer and storage.
A close second is treating ClickHouse JOIN like Postgres JOIN. ClickHouse defaults to broadcast hash joins where the right-hand table is loaded into memory on every node — fine for a 10M-row dimension against a 10B-row fact, deadly if you swap the order. Put the smaller table on the right, use GLOBAL JOIN for distributed setups, or denormalize into a wide table with LowCardinality(String) for repeated values.
Time zone handling is another quiet trap. PostgreSQL's TIMESTAMP WITH TIME ZONE stores UTC and converts at display. ClickHouse's DateTime is timezone-naive by default; you declare DateTime('UTC') per column, and mixing columns with different declared timezones in a WHERE clause silently uses the server's setting. Standardize on DateTime('UTC') across the schema and convert at the dashboard layer.
A subtler pitfall is relying on uniq for billing-grade counts. uniq is HyperLogLog with about 2% relative error — perfect for product dashboards, dangerous for invoices. If finance asks "how many unique customers paid us last month," reach for uniqExact even though it is slower. Approximate is a feature, but the trade-off has to be explicit.
Schema design for ClickHouse demands more thought than Postgres because the ORDER BY key cannot change after creation without rewriting the table. Picking ORDER BY (user_id, event_time) versus ORDER BY (event_time, user_id) swings query speed by an order of magnitude depending on whether dashboards filter by user or date. Match the most common WHERE clauses on the largest tables and put low-cardinality columns first.
Related reading
- SQL window functions interview questions
- SQL for data engineer interview
- Apache Iceberg deep dive for data engineering interview
- Materialized views for data engineering interview
- Table partitioning in SQL
- EXPLAIN and query plan for data engineering interview
If you want a drill bench that includes both PostgreSQL and ClickHouse-style aggregation questions, NAILDD is launching with 500+ SQL problems graded against real interview rubrics.
FAQ
Does an analyst really need to know ClickHouse?
For junior roles, no — PostgreSQL covers the syntax bar at almost every interview. For mid and senior roles at companies past Series C, yes: at minimum the syntax differences (toStartOfMonth, countIf, uniq, argMax), the row-vs-column distinction, and why you batch inserts. Deep MergeTree knowledge shows up in staff-level system design rounds.
Can I write PostgreSQL syntax in a ClickHouse interview?
Usually yes, as long as you flag it. Most interviewers evaluate logic over dialect, so writing DATE_TRUNC in a ClickHouse shop will not fail you. Saying "I would write this as toStartOfMonth in ClickHouse, but the logic is the same" signals that you know the gap exists. The opposite — ClickHouse-only syntax in a Postgres shop — is riskier because the interviewer may not catch your intent.
Which engine should I learn first?
PostgreSQL, every time. Its syntax is the closest mainstream dialect to standard SQL, it is the default on every coding platform, and the concepts — joins, window functions, CTEs, query plans — transfer almost completely. Learn ClickHouse second, when you already have a Postgres reflex.
What about BigQuery, Snowflake, or Redshift?
All three are column-oriented analytical engines like ClickHouse, but they are managed cloud services with different pricing models and dialects. The mental model — columnar storage, vectorized aggregation, sparse indexes, approximate functions — transfers directly. If your target company runs on Snowflake, learn Snowflake syntax; the underlying intuition is the same.
How big does my data have to be before ClickHouse pays off?
Roughly, when a Postgres aggregation on the table you care about takes more than 30 seconds even with the right index. That typically lands between 100 million and 1 billion rows for an events table, depending on row width and how many columns the query touches. Below that threshold, a well-indexed Postgres with pg_stat_statements is almost always the right answer.