Table partitioning in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why partition a table

Your team at Stripe, DoorDash, or Notion lets an orders table grow and one Monday the on-call analyst opens a Slack thread reading "the daily revenue dashboard takes 14 minutes". Schema and indexes look fine, the plan is doing an index range scan. The catch is the index is 80 GB, the heap is 1.2 TB, and even "fast" lookups touch enough disk pages to fall out of buffer cache. More indexes only make writes slower.

Partitioning fixes this at the storage layer. The single physical table becomes a collection of smaller child tables sharing the same schema. A query filtering on the partition key reads only the children that can contain matching rows; the optimizer skips the rest before opening a file. This is partition pruning, and on time-series tables it routinely turns a 14-minute scan into a six-second one without changing application code.

An orders table with 1 billion rows partitioned by month gives you 24 children at 40 million rows each. A query like WHERE created_at >= '2026-05-01' touches only the current month, so the engine reads one twenty-fourth of the data and the working set fits in RAM.

When partitioning is worth it

Not every big table should be partitioned. Complexity has a real cost: more DDL, more vacuum work, more metadata. Reach for partitioning when the table is at least 50-100 million rows and growing predictably. Below that, a well-tuned b-tree index on the filter column almost always wins.

Query pattern matters as much as size. Partitioning helps when most queries filter on a single column, usually a timestamp or tenant identifier. If queries hit the table uniformly across all dimensions, pruning never fires. Inspect the slow query log first: if the top 10 queries all have WHERE created_at BETWEEN clauses, you have a strong candidate.

Lifecycle is the second driver. A table where data older than 90 or 180 days is archived or dropped is a perfect fit, because DROP TABLE orders_2024_q1 takes a second, while DELETE FROM orders WHERE created_at < ... can run for hours and shred replication lag. The same logic applies to bulk loads: inserting into an empty partition is dramatically faster than inserting into a 600 GB table with seven secondary indexes.

The three partitioning schemes

There are three schemes you will see in interviews and in real systems, and the first covers about 80% of cases.

Range partitioning splits the data by ranges of the partition key. The textbook case is one partition per month or per quarter on a timestamp column. Range partitions match how time-series data is queried and dropped, which is why every analytics warehouse defaults to this scheme.

List partitioning assigns rows based on a discrete list of values: one partition per region, business unit, or data source. Useful when you have a small fixed set of values with very different query and retention rules, and unhelpful otherwise.

Hash partitioning runs the partition key through a hash function and assigns rows to a fixed number of buckets. It does nothing for selective queries but spreads write load and helps parallel scans.

PostgreSQL syntax

Declarative partitioning landed in Postgres 10 and matured through 12-13. By Postgres 14 it is solid enough for primary OLTP tables at companies running tens of thousands of transactions per second. The minimal setup defines a parent with a partition key and attaches children:

-- Parent declares the partition key but holds no rows itself
CREATE TABLE orders (
    order_id BIGSERIAL,
    user_id BIGINT,
    amount NUMERIC(10, 2),
    created_at TIMESTAMPTZ NOT NULL,
    status VARCHAR(20)
) PARTITION BY RANGE (created_at);

-- Child partitions, one per quarter
CREATE TABLE orders_2026_q1 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');

CREATE TABLE orders_2026_q2 PARTITION OF orders
    FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');

CREATE TABLE orders_2026_q3 PARTITION OF orders
    FOR VALUES FROM ('2026-07-01') TO ('2026-10-01');

Every INSERT INTO orders is automatically routed to the correct child based on created_at, and every SELECT with a compatible filter only touches the relevant partitions. Application code does not need to know that partitioning exists.

One consequential rule: rows whose partition key falls outside any existing partition cause the insert to fail. Most teams pre-create partitions several months ahead with a scheduled job.

Automating partition creation

Creating next month's partition by hand is a chore that gets forgotten the one week it actually matters. Two approaches are battle-tested.

The first is pg_partman, a PostgreSQL extension that handles partition creation, retention, and sub-partitioning. A single call configures monthly partitions and maintains three months of forward-dated partitions:

SELECT partman.create_parent(
    p_parent_table => 'public.orders',
    p_control      => 'created_at',
    p_type         => 'native',
    p_interval     => 'monthly',
    p_premake      => 3
);

A background worker runs daily, creates partitions that do not yet exist, and optionally detaches partitions older than the retention window. This is the lowest-risk approach for a team that does not want to maintain its own scheduling layer.

The second approach is a small in-house script. A Python or PL/pgSQL function that runs from cron, GitHub Actions, or any scheduler is fine and gives you full control over naming and retention. Either way the rule is the same: the partition for the next 60-90 days must exist before the first row that needs it arrives, and monitoring should alert when that invariant breaks.

Querying a partitioned table

Application code does not change. Reads and writes target the parent and the planner picks the right children:

-- Postgres reads only orders_2026_q2
SELECT order_id, user_id, amount
FROM orders
WHERE created_at >= '2026-04-15'
  AND created_at <  '2026-04-20';

Confirm pruning by running EXPLAIN. The plan should show an Append node listing only the partitions that survive the filter, not every child. If every child appears even with a targeted filter, the partition key was not used correctly or the filter wraps the column in a function the planner cannot push down.

Partitioning never speeds up a WHERE user_id = 42 query unless you partitioned on user_id. The classic interview question is "we partitioned by date, why is this user_id query still slow", and the answer is that pruning needs the partition key in the predicate.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Archiving and deleting old data

This is the operational superpower of partitioning. Deleting data from a giant unpartitioned table is one of the most painful operations in any database. DELETE FROM orders WHERE created_at < '2025-01-01' on a billion-row table can run for hours, generate massive WAL traffic, bloat tables, and block autovacuum.

With partitions the same archival is a metadata operation:

-- Detach so the partition becomes an independent table
ALTER TABLE orders DETACH PARTITION orders_2024_q1;

-- Snapshot to cold storage, copy to Snowflake, whatever you need
-- Then drop when you no longer need it locally
DROP TABLE orders_2024_q1;

DROP TABLE on a partition completes in milliseconds because it only updates catalog entries and unlinks the underlying files. No row-by-row work, no WAL flood, no autovacuum debt. This alone is the biggest lifecycle win for any team running OLTP workloads with regulatory retention rules.

Partitioning in ClickHouse

ClickHouse handles partitioning differently because the engine is column-oriented and was designed for analytics from the start. The partition expression is part of CREATE TABLE and the engine manages partitions automatically:

CREATE TABLE events (
    event_time DateTime,
    user_id    UInt64,
    event_name String,
    value      Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time);

PARTITION BY toYYYYMM(event_time) produces one partition per month with no manual DDL. ORDER BY defines the primary index that sorts rows inside each partition, so the engine combines partition pruning with sparse-index lookups. A query filtering on month plus user gets two layers of skipping for free.

Practical advice from production ClickHouse deployments: keep monthly granularity unless data volume warrants daily. Daily partitions on a fast-growing table inflate part counts and slow merges.

Hash partitioning

When there is no natural range key but you still need to spread write or scan load, hash partitioning fits:

CREATE TABLE users (
    user_id BIGINT,
    name    TEXT,
    email   TEXT
) PARTITION BY HASH (user_id);

CREATE TABLE users_0 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE users_1 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE users_2 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE users_3 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Rows distribute evenly across the four buckets. A lookup by specific user_id still touches only one partition because the planner computes the hash and prunes. A bulk job processing users in parallel can fan four workers across four partitions without contention. Hash partitioning will not speed up a range scan, so reach for it when concurrency is the bottleneck, not selectivity.

Common pitfalls

The biggest pitfall is partitioning on the wrong column. A team partitions on country because that felt natural at design time, then every meaningful query filters on created_at, and pruning never fires. Look at the slow query log before choosing a key, not the schema diagram. If your top queries all filter on created_at, partition on created_at.

The second pitfall is creating too many partitions. Thousands of tiny partitions look great in isolation but planning overhead, file-handle pressure, and metadata bloat become real costs once you cross a few hundred. Aim for 10-100 active partitions and resist the urge to partition by hour just because you can.

The mirror image is creating too few. Splitting a billion-row table into two pieces gives you 500-million-row children, each of which is still slow. The split needs to be aggressive enough that any single query touches a small minority of the data. If you are partitioned but still scanning everything, the granularity is wrong.

Forgetting to create the next partition is the textbook outage cause for self-managed partitioning. The insert that arrives on the first of the month with no destination partition errors out and someone gets paged at 3 a.m. Use pg_partman or a scheduled job, and monitor when the latest partition is less than 30 days into the future.

Index maintenance is subtler than it looks. In Postgres 11+ indexes defined on the parent are automatically created on every partition, but unique indexes must include the partition key. Engineers run into this when they try to add UNIQUE (email) to a table partitioned by created_at and get a confusing error. Either include the partition column in the unique constraint or enforce uniqueness at the application layer with an idempotency token.

Partitioning vs sharding

Interviewers love this question because candidates conflate the two terms constantly. Partitioning splits a table into pieces inside a single database instance. All partitions share the same buffer pool, the same WAL, the same postgresql.conf. The engine does the routing and there are no distributed transactions.

Sharding splits data across multiple database instances behind a routing layer like Vitess, Citus, or an in-house proxy. It introduces cross-shard queries, distributed transactions, rebalancing, and a much larger operational surface. Companies start with partitioning and only move to sharding when a single host can no longer keep up on CPU, memory, or IOPS.

If you want to drill schema-design and query-optimization questions like this every day, NAILDD is launching with 500+ SQL problems built from real interview loops.

FAQ

Is Postgres partitioning production-ready in 2026?

Yes, comfortably. Declarative partitioning in Postgres 10 was usable but limited. Postgres 12 and 13 closed the major gaps around foreign keys, default partitions, and runtime pruning. Any project on Postgres 14 or newer can use partitioning for primary OLTP tables, and most large fintech and marketplace platforms now treat it as the default for tables above 100 million rows.

How many partitions are too many?

The practical ceiling in Postgres is between 1,000 and 5,000 active partitions before planner overhead, lock contention, and metadata bloat start to matter. ClickHouse is more forgiving on storage but more sensitive to part count. If you find yourself wanting tens of thousands of partitions, you almost certainly need either coarser partitions or actual sharding.

Can I add partitioning to an existing table without downtime?

Not directly. Postgres does not let you convert a regular table into a partitioned table in place. The standard pattern is to create a new partitioned table beside the old one, backfill historical data, dual-write briefly to verify, and swap names in a single transaction. For a billion-row table that is a multi-day project.

Is ClickHouse better than Postgres for huge tables?

For analytical workloads on billions of rows, ClickHouse is typically 10-100x faster because the storage engine is column-oriented and partitioning is baked into the writer path. For transactional workloads with updates and strong constraints, Postgres still wins. Most platforms run both: Postgres for OLTP and ClickHouse for dashboards.

How does partitioning interact with indexes?

Indexes created on the parent in Postgres 11+ are automatically created on every partition, so you get familiar single-table semantics. The wrinkle is unique constraints, which must include the partition column to be enforceable. Local indexes are useful when only some children need a particular index, for example a full-text index that only matters on recent data.