Kafka rebalance for Data Engineering interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why rebalance is asked at every DE loop

When a Stripe or DoorDash interviewer asks how your Kafka consumer group reacts to a deploy, they are not testing whether you can recite the docs. They want to know if you understand that a single noisy consumer can stall an entire pipeline for tens of seconds, that the default since Kafka 3.0 is cooperative-sticky for a reason, and that max.poll.interval.ms is the line of fire more often than session.timeout.ms. Rebalance is where Kafka stops being a queue and starts being a distributed coordination problem.

The reason this topic dominates senior loops is simple. Rebalance bugs are invisible in a dev environment with one consumer and visible only in production with dozens of partitions and a rolling deploy. The candidate who has felt the pain — watched lag climb during a Kubernetes restart, debugged a poison message that blew past max.poll.interval.ms — explains it differently than the candidate who has only read the page. Interviewers can tell within two questions.

This guide walks through the model the way a senior engineer at Airbnb or Snowflake would expect you to. You will get the trigger list, the eager vs cooperative-sticky timeline, an assignor comparison table, the static membership story, and the tuning knobs that actually matter. If you have already read the Kafka consumer groups deep dive, this post picks up exactly where that one stops.

What triggers a rebalance

A rebalance is initiated by the group coordinator — a designated broker — whenever the membership of the consumer group changes, or whenever the topic metadata changes in a way that affects partition assignment. The five canonical triggers are:

A consumer joins the group for the first time, typically because a new pod came up after a deploy or a horizontal scale-up. A consumer leaves the group gracefully by calling close(), which sends a LeaveGroup request. A consumer crashes or loses its network connection, missing heartbeats for longer than session.timeout.ms. The topic gains or loses partitions through an admin operation. And — the one that catches teams in production — a consumer exceeds max.poll.interval.ms because a single message took too long to process, at which point the coordinator assumes the consumer is dead and kicks it out.

The last trigger is the most operationally painful because it looks like the consumer is alive — heartbeats are flowing, the process is healthy — but the main poll loop has been stuck inside user code for longer than the threshold. The coordinator has no way to tell the difference between "stuck in a bad message" and "GC pause that never ends," so it evicts.

Load-bearing trick: Heartbeats run on a background thread. max.poll.interval.ms is the only thing that protects the group from a consumer that is healthy at the socket layer but frozen inside the application.

The rebalance timeline, phase by phase

A rebalance is not a single event. It is a multi-phase protocol between every consumer in the group and the coordinator broker. Understanding the phases is what separates "I read about it" from "I have operated it." Here is the timeline as it actually runs.

Phase Who acts What happens Typical duration
1. Trigger Coordinator Detects membership or metadata change, bumps generation ID < 10 ms
2. Join All consumers Each sends JoinGroup; coordinator picks a group leader 100 ms – 2 s
3. Assign Group leader Runs the assignor strategy locally, returns the plan 10 – 200 ms
4. Sync All consumers Receive their new partition list via SyncGroup 50 – 500 ms
5. Revoke / commit All consumers Commit offsets for revoked partitions (eager) or only changed ones (cooperative) 100 ms – 30 s
6. Fetch All consumers Resume poll() against the new assignment resumes immediately

The total wall-clock cost lives mostly in phase 5. Under the legacy eager protocol, every consumer in the group revokes every partition and commits offsets before anyone resumes — this is the stop-the-world cost that ranges from a few seconds to a minute on a group of 50 consumers with chunky offset commits. Under cooperative-sticky, only the partitions that actually changed hands are revoked, so the majority of consumers keep processing without pause.

This is also why a single slow consumer can drag the whole group: phase 5 finishes when the last consumer reports back.

Assignor and protocol comparison

Kafka ships several PartitionAssignor implementations, and they trade off uniformity, stability, and rebalance cost differently. The decision table below is what you should be ready to draw on a whiteboard.

Assignor Protocol Stickiness Balance When to use
RangeAssignor Eager None Uneven on multi-topic groups Legacy default; avoid for new work
RoundRobinAssignor Eager None Even across partitions Multi-topic groups before cooperative existed
StickyAssignor Eager Yes Even, preserves prior assignment Bridge to cooperative on older brokers
CooperativeStickyAssignor Cooperative Yes Even, preserves prior assignment Default for Kafka 3.0+ — pick this
StreamsPartitionAssignor Cooperative Yes (task-aware) Co-locates state stores Only inside Kafka Streams apps

The two-axis story is protocol (eager vs cooperative) and stickiness (does the assignor try to preserve the previous plan). Cooperative-sticky wins on both axes for almost every modern workload. The only reason to stay on eager today is a broker older than 2.4 or a third-party client that has not implemented the cooperative handshake.

Gotcha: You cannot mix eager and cooperative consumers in the same group during a rolling upgrade. The migration path is a two-step deploy — first add CooperativeStickyAssignor alongside the existing eager assignor in the strategy list, deploy, then in a second deploy drop the eager one.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Static membership

Introduced in Kafka 2.4 and stable by 2.5, static membership lets a consumer keep its identity across restarts by setting group.instance.id. Without it, every restart looks like a fresh consumer joining and the old one leaving — two rebalances back-to-back, even though the cluster is functionally unchanged.

group.instance.id=consumer-payments-pod-3
session.timeout.ms=120000

When a static member disconnects, the coordinator does not immediately remove it from the group. It waits for the full session.timeout.ms window. If the same group.instance.id reconnects before the window closes, the coordinator simply hands back the same partition assignment — no JoinGroup, no SyncGroup, no rebalance. The pod resumes consumption from its previous offsets and the rest of the group never noticed.

The numbers that make this worth it on Kubernetes: a typical pod restart takes 20 – 90 seconds including image pull and readiness probes. A typical eager rebalance on a 30-consumer group costs 10 – 40 seconds of lag. Skipping that rebalance is straight throughput.

The trade-off is the inverse case: if a consumer dies and is never coming back, the group waits the full session.timeout.ms before redistributing its partitions. Tune that value to the longest restart you actually expect, not to your worst-case outage tolerance. Most teams land at 120 – 300 seconds for k8s workloads.

Tuning that actually moves the needle

The four configs that matter, in order of how often they bite in production. Note that defaults changed between Kafka versions, so always check what your broker reports.

max.poll.interval.ms (default 300 000 ms = 5 minutes). This is the deadline between two poll() calls. If your worst-case message takes 8 minutes to process — a wide Snowflake batch upsert, a model inference call, a large S3 PUT — bump this to 600 000 ms or higher. Most production incidents I have seen on rebalance loops trace back to this one config being left at default.

session.timeout.ms (default 45 000 ms since 3.0; was 10 000 ms before). This is how long the coordinator waits between heartbeats before declaring a consumer dead. Pair with heartbeat.interval.ms, which should be roughly one-third of session timeout. With static membership, this also becomes the restart-tolerance window.

max.poll.records (default 500). The upper bound on records returned per poll(). Lower this when per-record processing is expensive — for example 50 for a consumer that does a downstream HTTP call per record. Keeps the loop responsive so max.poll.interval.ms is not the bottleneck.

partition.assignment.strategy (default [RangeAssignor, CooperativeStickyAssignor] in 3.0+ Java clients). For new applications, set this explicitly to CooperativeStickyAssignor only. The dual-default exists for migration; for greenfield, you do not need the eager fallback.

Common pitfalls

The pitfall that ships to production more than any other is mismatched assignors during a rolling deploy. A team flips one pod to cooperative-sticky, the group enters a degraded state because half the consumers are on a different protocol, and rebalance loops every few seconds. The fix is the two-step deploy described above — list both assignors in the strategy list for one release, then drop the eager one in the next.

A subtler trap is assuming session.timeout.ms protects you from stuck consumers. It does not. Heartbeats are sent from a dedicated background thread that keeps reporting even while the main thread is blocked inside user code. The only knob that catches a frozen poll loop is max.poll.interval.ms. Teams that increase session timeout to "fix" rebalance storms often find the storms continue because the real problem was an unhandled exception inside the processor.

Another costly mistake is committing offsets synchronously in the revocation callback. The default ConsumerRebalanceListener.onPartitionsRevoked runs before partitions move, and a slow synchronous commit there extends phase 5 of the rebalance for the entire group. Commit asynchronously throughout the lifetime of the consumer and use the revocation callback only for a final synchronous flush of in-flight state, not for routine commit work.

Teams also routinely forget that consumer-side processing time counts. If your handler enriches each record with a 200 ms database lookup and max.poll.records is 500, a single batch takes 100 seconds. With the default 5-minute max.poll.interval.ms, two slow batches in a row evict the consumer. Lower max.poll.records, or move the I/O off the poll thread.

The last one is a classic: using a non-sticky assignor on a stateful consumer. If each consumer caches expensive state per partition — a feature lookup table, a windowed aggregate — then a non-sticky rebalance forces every consumer to rebuild its cache. Cooperative-sticky preserves assignments where possible, which means cache warmth survives most rebalances.

If you want to drill Kafka and streaming questions like this every day, NAILDD is launching with a structured Data Engineering track that walks through rebalance, partitioning, and exactly-once across hundreds of interview-style problems.

FAQ

Is cooperative-sticky always better than eager?

For almost every modern Kafka workload, yes. Cooperative-sticky removes the stop-the-world cost of eager by keeping unchanged assignments live during the rebalance. The narrow exception is a group with a very small number of consumers — say two or three — and very low partition counts where the coordination overhead of the cooperative protocol is roughly equal to the cost of a fast eager rebalance. For any group above ten consumers, cooperative-sticky is the answer.

What is the difference between sticky and cooperative-sticky?

StickyAssignor is an eager-protocol assignor that tries to preserve assignments across rebalances but still revokes everything during phase 5. CooperativeStickyAssignor uses the cooperative protocol, so the consumers whose assignments did not change keep processing throughout. Same stickiness goal, very different runtime cost. Use cooperative-sticky for any greenfield work.

How do I know if rebalance is hurting my pipeline?

Watch two JMX metrics: kafka_consumer_coordinator_rebalance_rate_per_hour and kafka_consumer_coordinator_rebalance_latency_avg. A healthy group should see fewer than one rebalance per hour outside of deploys, and average rebalance latency under two seconds. Anything above five rebalances per hour during steady state is a signal — usually max.poll.interval.ms exhaustion or a flaky pod.

Should I increase session.timeout.ms to reduce rebalances?

Only if you are using static membership and want a larger restart window. Increasing it without static membership delays failure detection — a genuinely dead consumer keeps its partitions for the full timeout, growing lag. The right knob for "consumer is alive but slow" is max.poll.interval.ms, not session timeout.

Can I trigger a rebalance manually?

Yes — calling consumer.enforceRebalance() from the Java client schedules one for the next poll. This is occasionally useful in tests or to force a re-read of metadata. In production, do not script this; let membership changes drive rebalances naturally.

Does static membership work with cooperative-sticky?

Yes, and the combination is the recommended default for Kubernetes deployments. Static membership skips the rebalance during a fast restart; cooperative-sticky minimises the cost of the rebalances that do happen. Together they make rolling deploys feel like no-ops to downstream lag.