CAP theorem for the systems analyst interview
Contents:
Why interviewers ask
Every distributed system you will ever ship — a payment processor, a marketplace, a ride-hailing backend, a feature flag service — is a continuous compromise between correctness and availability. The systems analyst writes the line in the spec that says "the system must continue to serve requests when one node is unreachable", and a few sprints later the team ships either a fragile synchronous cluster or a fast eventually-consistent mess. Which one you got depends on whether anyone understood CAP before the spec was written.
The pain without CAP is common. The analyst writes "strict consistency across all nodes at all times", engineering builds synchronous multi-region replication, the first cross-region blip takes half the cluster offline. Or the inverse: a spec says "available always", the team ships an AP store for account balances, and a partition causes a double-withdraw bug that lands in a postmortem.
Load-bearing trick: CAP is a forced choice only during a network partition. In normal operation you can have both C and A — the question is what the system does the moment a split happens.
On the SA interview CAP gets asked at almost every mid-level loop, usually right after ACID and right before "design a notification service".
The three CAP properties
Consistency in CAP means every node returns the same data at the same instant. A write to node A is immediately visible on node B. This is linearizable consistency, the strictest form — not the "C" in ACID, which is about constraint preservation. Mixing these up is the single most common interview blunder.
Availability means every request gets a non-error response. There is no promise the data is the freshest version — only that the system answered.
Partition tolerance means the system keeps operating when network connectivity between subsets of nodes fails. Packets get dropped, switches die, regions disconnect from each other — partition tolerance is the property that says "we still serve traffic when this happens".
Brewer's theorem (2000) states that no distributed system can simultaneously guarantee all three. When a partition occurs, you must choose between C and A. In any real production system partitions are inevitable — networks fail, hardware dies, cloud providers have bad days — so the live choice is between CP (favor consistency, sacrifice availability during a split) and AP (favor availability, accept inconsistency during a split).
A useful mental model: CAP is not a triangle you draw on a whiteboard. It is a behavior the system exhibits the moment the network breaks.
CP, AP, CA — what to pick
CP systems block operations during a partition rather than risk serving stale data. The minority partition refuses writes (and often reads) until quorum returns. Examples: HBase, MongoDB with majority writes, Postgres with synchronous replication, ZooKeeper, etcd, Consul. Natural fit: bank balances, medical records, legal documents, distributed locks, service discovery.
AP systems keep accepting requests on both sides of the partition, knowing replicas will diverge and reconcile later. Conflicts get resolved through last-write-wins, vector clocks, or CRDTs. Examples: Cassandra, DynamoDB with eventual reads, CouchDB, Riak. Natural fit: social feeds, product catalogs, DNS, metrics and counters, S3-style blob stores.
CA does not exist in a distributed setting. If you have more than one node, the network will eventually partition, and you must pick. A single-node Postgres labeled "CA" just means "not distributed". On the interview, "we'll build a CA system" signals you haven't thought about this seriously.
Sanity check: If the question is "what does your system do when the network between us-east and eu-west breaks for 30 seconds?" and your answer is "that won't happen", you have failed the CAP question regardless of what you said about C, A, and P.
BASE and eventual consistency
BASE — basically available, soft state, eventual consistency — is the AP-system counterpart to ACID. It is not a property you turn on; it is a description of how an AP store behaves.
- Basically Available — the system responds, even if the response uses stale data
- Soft State — state can change without an explicit client request, because background replication is happening
- Eventual Consistency — once writes stop, all replicas will converge to the same value
In practice eventual consistency converges in tens to hundreds of milliseconds for in-region replication, and single-digit seconds for cross-region. A write to Cassandra lands in one rack, replicates to two more within 50–500 ms, and reaches the other region a moment later. During that window a reader can see the old value, the new value, or nothing — depending on which replica they hit.
Read-your-own-writes guarantees that after a client writes X, that same client's next read returns X (other clients may still see the old value). Implemented via session stickiness or overlapping read/write quorums.
Quorum math is worth memorizing. With N replicas, W = write quorum, R = read quorum, the rule R + W > N guarantees any read intersects with the latest write. Cassandra's classic setup is N=3, W=2, R=2 — 2 + 2 > 3. Drop W to 1 for throughput and you lose the guarantee.
Distributed database examples
| Database | CAP type | Notes |
|---|---|---|
| Postgres (single node) | CA (by elimination) | Not distributed, so CAP doesn't really apply |
| Postgres + sync replica | CP | Primary waits for replica ack; partition stops writes |
| MongoDB (W=majority) | CP | Majority write concern; primary down triggers reelection |
| Cassandra | AP | Eventual by default; tunable per request via consistency level |
| DynamoDB | Tunable | ConsistentRead=true is CP; default is AP/eventual |
| HBase | CP | Master + Region Servers; partition kills availability for affected regions |
| Redis Cluster | AP | Master/replica async; split-brain can drop writes |
| etcd / ZooKeeper | CP | Raft / ZAB consensus, used for config and locks |
| Spanner | CP-with-low-latency | TrueTime atomic clocks make global linearizability practical |
| Kafka | CP (per partition) | ISR + min.insync.replicas; loses availability if quorum lost |
The classic interview trap is "is MongoDB CP or AP?" The good answer: "by default CP with majority write concern; configure w=1 for AP-like behavior — it depends on write concern." That signals you know Mongo's behavior is a knob, not a fixed point.
A second trap is Spanner. People hear "globally distributed SQL" and assume AP. Spanner is CP — Google made it CP by inventing a hardware-backed clock system that makes consistency cheap.
PACELC — the extended model
PACELC (Daniel Abadi, 2010) extends CAP with the trade-off that exists outside of partitions:
- Partition: under a partition, choose between A and C
- Else (normal operation): trade off between Latency and Consistency
The insight is that even when nothing is broken, strict consistency costs you latency — every write has to wait for acks from other nodes, every linearizable read has to confirm it's not stale. So a complete characterization includes both the partition behavior and the normal-operation behavior.
| System | PACELC | Reading |
|---|---|---|
| Spanner | PC/EC | CP under partition, prefers consistency over latency normally |
| Cassandra | PA/EL | AP under partition, low-latency-over-consistency normally |
| MongoDB | PA/EC | AP-ish under partition, strict read concern available |
| Redis Cluster | PA/EL | Eventual between master and replica, optimized for speed |
| DynamoDB | PA/EL | AP under partition, eventual reads by default |
PACELC is the senior-level answer on the interview. Mid-level needs CAP; if you bring PACELC unprompted and explain why a Spanner read costs more than a Cassandra read in healthy operation, you have closed the loop the interviewer was hoping you'd close.
Common pitfalls
Designing for "CA" is the first big trap. If your system has more than one node communicating over a network, CA is not on the menu — partitions will happen. The fix is to decide upfront whether the spec needs CP or AP, write that decision into the requirements, and call out the behavior during a split explicitly. "On primary failover the system rejects writes for up to 10 seconds while a new primary is elected" is a real requirement; "the system is always consistent and always available" is wishful thinking.
Treating eventual consistency as "broken" is the second. Most consumer products on earth — social feeds, marketplaces, ad-serving, content recommendations — run on eventual consistency and nobody notices. The fix is to write the staleness budget into the spec. "Feed posts may take up to 5 seconds to appear for other users" is a clean requirement that engineering can validate; "feed must always be up to date" is a trap that drags you back into a CP architecture that you didn't actually need.
Ignoring write concern and consistency level trips up plenty of mid-level candidates. In Cassandra, MongoDB, DynamoDB, and Kafka the consistency knob lives on the operation, not on the cluster. The fix is to specify the level in the spec: "writes use majority concern, reads use local quorum". Saying just "we use MongoDB" tells engineering nothing about the guarantees the business needs.
Applying ACID instincts to distributed systems is the most common architectural sin. Two-phase commit across regions is slow, fragile, and locks resources for the duration of the slowest participant. The fix is Saga, outbox, event sourcing, or CDC-based projections — patterns that accept temporary inconsistency in exchange for resilience.
Not pricing in the cost of consistency is the sneaky one. Every strong-consistency read or write means at least one cross-node round trip — a few milliseconds in-region, 80–200 ms cross-region. The fix is to budget consistency the way you budget latency. "This endpoint must respond in under 100 ms" means you cannot afford a synchronous quorum across two continents.
Defaulting to AP everywhere is the inverse trap. Anywhere money, contracts, inventory commitments, or legal records live, AP can mean double-spend, oversold inventory, or duplicate charges. The fix is the boring one: AP for feeds and counters and catalogs, CP for ledgers and balances and locks, and a clear line in the spec about which is which.
Related reading
- ACID and isolation levels for the systems analyst interview
- Cache strategies for the systems analyst interview
- Kafka for the systems analyst interview
- API gateway vs BFF for the systems analyst interview
- Case interview for systems analysts
If you want to drill systems-analyst questions like this every day, NAILDD is launching with hundreds of SA problems on exactly this pattern — CAP, ACID, caching, messaging, and the trade-offs interviewers actually probe.
FAQ
Who proved the CAP theorem?
Eric Brewer proposed the conjecture in a 2000 keynote. It was formalized and proven by Seth Gilbert and Nancy Lynch at MIT in 2002. Brewer later (2012) wrote a clarifying piece pointing out that the theorem is often misread — CAP describes behavior under partition, not a permanent architectural choice. Citing both the original and the 2012 update on an interview signals you've actually read the source.
Is MongoDB CP or AP?
It depends on write concern and read concern, which are configurable per operation. With writeConcern: majority and readConcern: majority it behaves as CP — writes wait for acknowledgment from a majority of replicas, reads see only majority-committed data. With writeConcern: 1 you only wait for the primary, which is closer to AP behavior because a primary failover can lose unreplicated writes. Default-config modern MongoDB leans CP, but the honest answer on an interview is "depends on write concern".
What's the difference between sharding and replication in the CAP context?
Sharding splits data — different shards hold different rows. Replication copies data — multiple nodes hold the same rows. CAP applies to replication (multiple copies of the same data can diverge during a partition). Sharding by itself doesn't create a CAP trade-off, but distributed systems almost always combine the two, so in practice CAP shows up everywhere. A sharded-but-not-replicated store loses availability for a shard when the shard's node dies, with no consistency question to answer.
How do I write CAP into a spec without sounding academic?
Don't say "CAP" — say what the system does in plain language. AP: "the system continues to serve reads when a node is unreachable; changes may take up to 5 seconds to propagate." CP: "when the primary is unreachable, writes are blocked until a new primary is elected (typically 5–15 seconds); reads continue from replicas." The interviewer wants to see you translate the trade-off into a requirement engineering can build against.
What is split-brain?
A split-brain is a partition where multiple nodes each believe they are the primary at the same time. On AP systems it's tolerated — both sides keep accepting writes, and a reconciliation mechanism (last-write-wins, vector clocks, CRDTs) merges the results when connectivity returns. On CP systems split-brain is the failure mode you must prevent — that's the job of quorum-based leader election (Raft, Paxos, ZAB) and fencing tokens. If a candidate proposes leader election without quorum, they have built a split-brain generator.
Is this content official?
No. This article distills publicly documented sources: Brewer's CAP conjecture (2000), Gilbert & Lynch's proof (2002), Brewer's 2012 follow-up in IEEE Computer, Abadi's PACELC paper (2010), and the official docs of the databases referenced. Vendor behavior changes between versions; confirm against the docs for the version you're shipping.