Database sharding for the DE interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why sharding shows up in DE interviews

A senior DE at Stripe or DoorDash will tell you the same thing: a single Postgres instance hits a wall somewhere between 5 and 20 TB of hot data, depending on workload and write throughput. The cheapest fix is read replicas, then partitioning. Beyond that, you shard — split data across independent servers that share nothing. That's when your design interview stops being about query plans and starts being about shard keys, rebalancing, and cross-shard joins.

Interviewers like sharding for two reasons. First, the wrong answer can be confidently wrong — "we'll shard by user_id" is the equivalent of "we'll add an index" without thinking about cardinality. Second, it stress-tests how you reason about failure modes that don't exist on a single box: a celebrity user concentrating 30% of writes, a region failover that needs transactional guarantees, a backfill that moves a petabyte without breaking SLA.

This post walks the layouts that come up most — hash, range, geographic, directory — then covers hotspots, resharding, and tooling. The goal is to know which trade-off each scheme buys, so you can pick one out loud without flinching.

Hash sharding

Hash sharding is the default when the interviewer says "uniform write load, no obvious access pattern." Hash the shard key, modulo the shard count:

shard_id = hash(user_id) % N_shards

The win is even distribution. With a good hash function (MurmurHash, xxHash, MD5 truncated) and a high-cardinality key, every shard sees within a few percent of the same traffic. The losses are three, and you should know all three cold.

First, range queries become scatter-gather. SELECT * FROM users WHERE created_at BETWEEN ... fans out to every shard. For analytics this is fine; for OLTP it inflates tail latency.

Second, resharding is brutal. Naive % N reshuffles roughly (N-1)/N of the data when you add a shard. Consistent hashing earns its keep here: with a ring of virtual nodes you only move 1/N of the data. Cassandra, DynamoDB, and Riak all use variants of this trick.

Third, no locality. Two users who signed up the same minute end up on different shards. Cross-shard joins inherit that cost forever.

Load-bearing trick: when an interviewer asks "what if you need to add a shard?", the correct answer is consistent hashing with virtual nodes — not "we'll just rehash everything."

Range sharding

Range sharding splits the keyspace into contiguous slices.

Users 1-1_000_000        -> Shard 1
Users 1_000_001-2_000_000 -> Shard 2
Users 2_000_001-3_000_000 -> Shard 3

HBase, BigTable, and CockroachDB use this by default. The trade is the inverse of hash: range queries are cheap (often single-shard), locality is preserved, but distribution skews easily. Sequential keys (auto-increment IDs, timestamps) write everything to the last shard.

Property Hash sharding Range sharding Geographic sharding
Write distribution Even by design Skewed without auto-split Depends on user geo
Range query cost Scatter-gather across all shards Single shard if range is small Single shard per region
Resharding cost 1/N with consistent hashing Split-and-move per range Re-route per tenant
Locality None Strong (sequential keys adjacent) Strong within region
Best for OLTP point lookups, KV stores Time-series, log scans, analytics Multi-region SaaS, GDPR

The fix for range hotspots is auto-splitting — when a range exceeds a size or QPS threshold, the engine cuts it in half. CockroachDB and Spanner do this transparently. Doing it yourself in application-level sharding is a project that takes three engineers six months and never finishes.

Range sharding pairs well with time-based partitioning inside each shard, so you can drop old partitions cheaply.

Geographic and directory sharding

Geographic sharding assigns shards by user region: EU to an EU cluster, US to a US cluster. The wins are latency (database close to user) and compliance (data residency by design). The cost is that cross-region operations become expensive — a global leaderboard query has to hit every region.

Directory sharding keeps a lookup table mapping each tenant to a specific shard. Used heavily in B2B SaaS for tenant isolation: Acme Corp on Shard 4, Initech on Shard 7, noisy neighbours moved without touching anyone else. The cost is the lookup itself — every query needs the directory, so it becomes a critical path you cache aggressively.

Scheme Locality Isolation Operational cost
Geographic Strong (per region) Per-region failures contained Cross-region joins expensive
Directory Per tenant Tenant-level blast radius Directory cache must be hot
Hybrid (geo + directory) Region + tenant Both Highest, but most flexible

For a B2B fintech at Stripe's scale, the production answer is hybrid — directory at the top, geo for residency, hash or range inside each region. You should be able to draw that on a whiteboard.

Hotspots and how to kill them

A hotspot is when one shard absorbs disproportionate traffic — say 45% of writes hitting one of 10 shards. Three common causes:

The celebrity user problem. One user (a viral creator, a tier-one merchant, a scraping bot) generates orders of magnitude more traffic than the median. Hash sharding doesn't save you — all that user's data lives on one shard by construction.

The time-based key problem. If your shard key is a monotonic timestamp or sequential ID, every new write goes to the same shard until you split. Classic "write to the tail" antipattern.

The skewed natural key problem. Sharding tweets by country_code looks fine until you remember the US accounts for a large fraction of activity while you have 200 country shards. Cardinality is fine; distribution isn't.

Sanity check: before picking a shard key, look at the actual distribution of values in production. A key that looks balanced in theory often has a Zipfian tail nobody noticed until 2 a.m.

Mitigations follow the same patterns regardless of cause. Salting — appending a random suffix (user_id || (random() % 10)) — spreads a single entity across multiple physical shards, at the cost of read fan-out. Read replicas absorb read traffic but do nothing for writes. Dedicated shards for whales — promoting the top 0.1% of accounts to their own infrastructure — is what payment processors do for enterprise customers.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Resharding without downtime

Resharding is the hardest operation in any distributed database. The interview-ready playbook has five stages:

1. Provision new empty shards.
2. Background-copy historical data to new locations.
3. Open a dual-write window (write to old + new).
4. Cut reads over to the new shards.
5. Decommission the old layout after a verification period.

Each step has a failure mode. The copy stage takes days for terabyte data, during which the source is still taking writes — so you capture and replay the delta via CDC on the write-ahead log. The dual-write window is the dangerous one: any bug causes silent divergence between old and new, the kind of incident that ends careers. You verify with continuous shadow reads — read both, compare, alert on mismatch.

A realistic timeline for 10 TB across 8 shards going to 16 at a well-staffed shop:

Phase Duration Risk level
Plan + design review 2-3 weeks Low
Provision and bulk copy 1-2 weeks Medium
Dual-write + shadow reads 2-4 weeks High
Cutover + soak 1 week Highest
Decommission old layout 1-2 weeks Low

Six to twelve calendar weeks for one resharding. Engineers who've lived through one tend to design the next system to never need a second.

Tooling — Vitess, Citus, Mongo

Three names that should be on the tip of your tongue:

Vitess — open-sourced from YouTube, used by Slack, GitHub, Shopify. Sits in front of MySQL with sharding, query routing, online resharding, and a topology-hiding query layer. The default answer for "giant MySQL that needs to scale."

Citus — a Postgres extension (Microsoft-owned, ships in Azure Database for PostgreSQL) that distributes tables across worker nodes. Declare SELECT create_distributed_table('orders', 'customer_id') and Citus handles routing. Strong for multi-tenant SaaS.

MongoDB sharded clusters — Mongo's native option. Config servers hold metadata, mongos routers dispatch queries, shards hold data. Hash or range selectable per collection. Operational complexity is the trade.

Knowing which tool fits which workload — Vitess for MySQL OLTP, Citus for multi-tenant Postgres, Mongo for documents — separates a senior from a staff candidate in interview.

If you want to drill questions like this end-to-end, NAILDD has a growing set of Data Engineering system design problems built around exactly this pattern.

Common pitfalls

The shard-key-as-an-afterthought trap. Teams pick a shard key in week one, ship, and spend year three trying to change it. The shard key locks in every access pattern cheaply, so it deserves more design time than almost any other schema decision. The fix is to enumerate the top five queries by frequency and the top three by latency before the first migration, and verify the shard key serves all of them.

The cross-shard transaction illusion. Most sharded databases either don't support distributed transactions, or support them with caveats that rule out hot paths. Candidates hand-wave "we'll use 2PC" without acknowledging that two-phase commit doubles latency and adds a coordinator failure mode. The fix is to design so anything transactional lives on one shard — meaning the shard key is the entity that needs atomicity (the order, the account, the workspace), not the one that feels semantically natural (the user).

The resharding-is-routine fantasy. Engineers who've never done a live resharding sketch the dual-write phase in two sentences. The reality is that dual-write divergence bugs are the most common cause of data corruption in sharded systems. The fix is to budget for shadow reads, an explicit reconciliation pass, and a rehearsed rollback plan.

The hot-shard whack-a-mole loop. A team notices Shard 3 is hot, moves the heaviest tenant, and three weeks later the next-heaviest tenant makes Shard 3 hot again. Reactive means always one celebrity user behind. The fix is continuous monitoring of per-shard traffic with an alert when any shard exceeds 1.5x the cluster median, plus a documented redistribution playbook that takes hours, not weeks.

The scatter-gather-everywhere anti-pattern. Once sharded, every aggregate query becomes scatter-gather. Teams paper over with stale caches and lagging materialized views. The fix is to push analytical workloads to a separate columnar store — ClickHouse, BigQuery, Snowflake — and keep the sharded OLTP system focused on point operations.

FAQ

How do I pick a shard key in a system design interview?

Start with the access pattern, not the data model. List the top three queries by frequency — if they filter by the same column, that's your shard key candidate. Check that the column has high cardinality (at least 1000x the shard count) and roughly uniform distribution. If a single value carries more than 5% of traffic, you have a hotspot risk that needs salting or a dedicated shard. Sanity-check by asking what cross-shard queries the design forces — if the answer is "everything important," reconsider.

What's the difference between sharding and partitioning?

Partitioning splits a table into pieces on the same database instance — same server, same storage, used for query pruning and easier retention. Sharding splits data across independent instances that share nothing — different servers, different storage, often different physical locations. Partitioning is a single-machine optimisation; sharding is horizontal scale-out. Many systems use both: sharded across servers by tenant, partitioned within each shard by date.

When should I shard versus just add read replicas?

Add read replicas first. Replicas handle read-heavy workloads cheaply, preserve single-database properties, and require almost no application changes. You shard when writes are the bottleneck — your primary can't keep up with throughput, storage exceeds what one machine can hold, or single-region latency is a compliance problem. Heuristic: if read-write ratio is above 10:1 and you haven't run out of disk, replicas are usually enough.

How does consistent hashing actually work?

You map both shards and keys onto the same ring (typically a 32- or 64-bit hash space). For each key, walk clockwise until you hit the first shard — that's where the key lives. Adding a shard moves only the keys between the new shard and its clockwise neighbour, roughly 1/N of the data. To smooth distribution, each physical shard registers many virtual nodes on the ring (100-200 per shard), so removing one spreads its load across many neighbours rather than dumping it on one.

Is sharding ever the wrong answer?

Yes — refusing to shard when not needed is a good signal. If the scenario fits on a beefy Postgres box with read replicas, say so. Sharding adds operational complexity, cross-shard query cost, and a permanent tax on every future schema change. Strong candidates reach for sharding only after vertical scaling, read replicas, caching, partitioning, and archiving cold data have been ruled out. Reaching for the biggest hammer first is a junior tell.