2PC vs Saga in the systems analyst interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why interviewers ask this

The moment an interviewer at Stripe, DoorDash, or Uber sketches an order-checkout flow, the next question is almost always "how do you keep the four services consistent?" Microservices broke the ACID guarantees most analysts learned on a monolith, and the 2PC vs saga trade-off is one of the cleanest signals of whether a systems analyst can design realistic integrations or is still mentally living inside a single Postgres database.

This is a must-have topic for any middle-to-senior SA loop. Hiring managers want you to reason about partial failure, not recite definitions. The classic order → payment → inventory → shipping flow keeps coming back because it forces you to talk about locks, retries, idempotency, and observability at once. Candidates who stumble are the ones who write "wrap everything in a transaction" and then look surprised when the tech lead asks which of the four databases that transaction lives in.

The pain of getting this wrong is concrete: you ship a requirements doc that assumes atomicity, the team gets two sprints in, and the architect throws it back because the flow cannot exist across four service boundaries.

The distributed transaction problem

Inside a monolith, ACID gives you everything for free: either every statement commits, or none do. Step outside one database and three things break at once. Each service owns its own data store. No service can synchronously roll back changes already committed in another service. And the network is hostile — requests hang, get dropped, or arrive twice.

Take the canonical Amazon-style checkout. A user clicks "Place order" and four things must happen:

  1. Create the order in the Orders service.
  2. Reserve stock in the Inventory service.
  3. Charge the card in the Payment service.
  4. Schedule the shipment in the Shipping service.

If step 3 fails, what do you do with the order that's already created and the stock that's already reserved? Leaving it pending is a bug with a polite name. You need a protocol that drives the workflow to completion or unwinds it cleanly. Three families exist: 2PC, saga, and eventual consistency built on top of saga.

Property Monolith ACID 2PC Saga
Atomicity Free Yes (with locks) No — only "eventually"
Isolation Free Yes No — semantic locks only
Latency cost Low High (sync coordination) Medium (async steps)
Failure radius Single DB Coordinator + all participants One step at a time
Realistic for REST microservices? N/A Almost never Yes — the default

Two-phase commit (2PC)

A coordinator plus N participants agree on a single outcome through two synchronous rounds.

Phase 1 — prepare: the coordinator asks each participant PREPARE?. Each participant checks whether it could commit, reserves the resource, holds a lock, and replies YES or NO. The lock is held until the coordinator returns with the verdict — that single sentence is the source of every 2PC pain point.

Phase 2 — commit or abort: if every participant said YES, the coordinator sends COMMIT to all. A single NO flips the whole thing to ABORT.

Coordinator → All: COMMIT

Elegant on paper, brutal in production. Locks block other writers for the duration of phase 2. The coordinator is a single point of failure — if it dies between phases, participants are stuck holding locks. It cannot handle long operations, since a one-minute lock window is a disaster at scale. And REST microservices have no native 2PC support — you'd bolt on XA or hand-roll a protocol, and almost nobody does.

Load-bearing rule: 2PC trades availability for consistency and assumes participants are fast and trustworthy. Production microservices fail both assumptions, which is why saga wins outside the database boundary.

In practice, 2PC lives inside a single distributed database (sharded Postgres via FDW, Greenplum, some XA-enabled JEE stacks) and almost nowhere else. 3PC — a three-phase variant that adds a pre-commit round to avoid blocking — is a textbook curiosity that essentially never ships.

Saga and compensating transactions

A saga splits one long business transaction into a sequence of local transactions, each with a compensating transaction that undoes its effect.

T1: create order         → C1: cancel order
T2: reserve stock        → C2: release stock
T3: charge card          → C3: refund card
T4: create shipment      → C4: cancel shipment

If T3 fails, the saga runs C2 and C1. T4 never happens. The compensations execute in reverse order of the forward steps.

Three saga properties matter for the interview. It is not atomic — between steps the system is in a partial state, so your data model needs explicit status enums (pending, reserved, paid, shipped, compensating, failed). It is not isolated — concurrent sagas see in-flight orders, handled via semantic locks where pending blocks other workflows from touching the same aggregate. It is durable, because each local transaction commits immediately and releases its locks before the next step runs.

Gotcha: A compensating transaction is rarely the literal inverse of the forward step. A refund is not "uncharge" — it's a new accounting entry. In a ledger you'll see both the original charge and the refund. Auditors prefer this; "magic erase" is what fraud looks like.

Choreography vs orchestration

Once you've committed to saga, there are two implementation styles, and interviewers will absolutely ask you to compare them.

Choreography. Every service subscribes to events and reacts; no central brain. Orders publishes OrderCreated, Inventory consumes and emits ItemReserved, Payment emits PaymentProcessed, Shipping emits ShipmentCreated. A *Failed event triggers compensations. The win is loose coupling — add a fraud check by subscribing it without touching anyone else. The cost is visibility: no single place to see "where is order #123 right now?" You end up building an event tracker, and business logic smears across every service.

Orchestration. A central saga orchestrator owns the workflow definition and tells each service what to do.

Orchestrator:
  Order.create()
  Inventory.reserve()
  Payment.charge()
  Shipping.create()

On failure the orchestrator knows the exact compensations to run. The trade-off: visibility and control on one side, a new critical component on the other. Most production sagas use orchestration via Temporal, AWS Step Functions, Netflix Conductor, or Camunda — past three steps, choreography's observability tax is unbearable.

Dimension Choreography Orchestration
Coupling Loose Tighter (services know orchestrator)
Logic location Spread across services Centralized
Observability Hard — need event tracker Easy — orchestrator state
Best for 2-3 steps, simple compensation 5+ steps, complex rollback, audit
Common tools Kafka + handlers Temporal, Step Functions, Camunda

Rule of thumb: 2-3 steps → choreography is fine. 5+ steps or regulated workflows → orchestration. Anything finance- or compliance-adjacent — Stripe-style payments, a healthcare claim, a marketplace payout — wants explicit orchestration so auditors can read the workflow definition as a source of truth.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Eventual consistency and the outbox pattern

Saga is the canonical case of eventual consistency: the system is not consistent at every moment, but it converges. The window of inconsistency is real and visible to users; UX needs to know about it.

The mistake that breaks saga is the dual write problem. A service tries to (a) write to its own DB and (b) publish an event to Kafka in the same logical step. If the DB write succeeds but Kafka is unreachable, the event is lost and the saga stalls silently. Try/catch doesn't help — you've already committed locally.

The fix is the outbox pattern:

  1. In one local transaction, write the business data and an event row into an outbox table.
  2. A separate worker (or Debezium CDC) reads outbox, publishes to Kafka, marks the row sent.
  3. If the worker crashes, it resumes from unsent rows after restart.

The guarantee is at-least-once delivery, so consumers must be idempotent — typically via an idempotency_key short-circuit. The symmetric pattern on the consumer side is the inbox: persist the incoming event before processing so retries don't double-apply.

Sanity check: If your design has any service calling db.commit() and then kafka.publish() in sequence, it's broken under partial failure. Outbox or transactional CDC are the only safe options.

When to pick what

A short decision tree:

  • Inside a single (even distributed) database: use DB transactions. XA or 2PC if the engine supports it. Don't reinvent saga where ACID exists.
  • 2-3 microservices, simple flow: choreography saga over Kafka or NATS.
  • 5+ services, complex compensations, audit needed: orchestration via Temporal, Step Functions, or Camunda.
  • Strong real-time consistency required: revisit your service boundaries — they're probably wrong.
  • Need guaranteed event delivery: outbox, at-least-once consumer, idempotency keys everywhere.

Common pitfalls

Using 2PC between REST microservices. Technically possible with XA proxies, practically a way to ship locks and timeouts to prod. The correct framing is that 2PC belongs inside a database boundary; cross-service workflows belong in saga.

Treating saga as a weaker transaction. It's a different consistency class, not a worse one. Calling saga "weaker" suggests you'll design the spec as if you could secretly buy back ACID with enough engineering. You can only buy back eventual atomicity — communicate that to product so UX accounts for in-flight states.

Forgetting to design compensations. A saga without compensations is just an ordered chain of calls that wedges itself the first time step three fails. Compensations are half the design. For every forward step, the spec should list the compensation, the trigger event, and the side effects that cannot be undone (emails sent, push notifications fired, partner APIs called).

Skipping idempotency. In eventual-consistency systems, every message gets delivered more than once eventually. Every step must be idempotent on its idempotency_key, or the second delivery charges twice, reserves twice, ships twice. Idempotency is not optional; it's the price of admission.

Ignoring externally-visible side effects. Compensating a charge is fine — refunds are first-class. Compensating a confirmation email that already reached the customer is not. Mark steps with irreversible external effects and either delay them until the saga commits or design an apology flow.

Skipping the correlation ID. Without a saga_id (or correlation_id) on every event and log line, debugging a stuck saga is detective work in the dark. The first thing SRE asks is "give me the trace for order 42"; if your design doesn't propagate that ID end to end, the answer is "we can't".

Designing a 20-step saga. Long sagas are unmaintainable and impossible to test. Above seven or eight steps, decompose into sub-sagas with their own compensations and a parent saga that coordinates them.

Confusing orchestrator with business logic. The orchestrator coordinates the workflow; it does not own domain rules. Pricing logic stays in Payment, stock rules stay in Inventory. An orchestrator that contains business logic is a distributed monolith with extra steps.

Want to drill distributed-system design questions like this one every day? NAILDD is launching with 500+ interview problems covering exactly this pattern — saga design, outbox, idempotency, and the rest of the modern SA loop.

FAQ

Does saga guarantee consistency?

It guarantees eventual consistency. At any moment, the databases behind your services can disagree — order says paid, payment says pending, inventory says reserved — but the saga drives the system toward a terminal state where either every step succeeded or every successful step was compensated. If the interviewer pushes, name the trade-off: you traded immediate atomicity for availability and scalability.

Can 2PC ever work in microservices?

Technically yes via XA transaction managers or hand-rolled protocols on REST. Practically nobody does it in modern architectures — the lock-holding window across a network is unacceptable and the coordinator failure modes are brutal. Proposing 2PC for an Uber-style flow on a whiteboard usually signals course-correction is needed.

Why isn't Kafka enough on its own?

Kafka is a transport. It gives you at-least-once or, with idempotent producers and transactions, exactly-once delivery to a topic. It does not know that "place order" means four coordinated business steps with compensations. You still need a saga — choreographed or orchestrated — running on top. Treating Kafka itself as a saga is a common interview red flag.

What is a process manager?

A synonym for saga orchestrator, popularized by Gregor Hohpe's Enterprise Integration Patterns. It refers to the central component that holds saga state and decides the next step. Naming EIP is a small but real signal of seniority.

How do you test a saga?

End-to-end tests cover the happy path and every compensation branch. Above that, chaos testing is mandatory for any saga that touches money: kill the payment service mid-saga, kill the orchestrator between steps, kill Kafka brokers, replay duplicate events, verify convergence.

Is this in any official spec?

No — distributed transactions are an architecture pattern, not a standard. References: Chris Richardson at microservices.io, Hohpe's Enterprise Integration Patterns, Temporal and Camunda docs, and the saga papers from Garcia-Molina and Salem (1987).