How multiple consumers coordinate to process partitions in parallel with fault tolerance, automatic rebalancing, and exactly-once guarantees
90% of messaging interviews
Powers systems at LinkedIn, Uber, Netflix
Billions of messages query improvement
Hundreds of parallel workers
TL;DR
Consumer groups enable multiple consumer instances to work together to process partitions from a topic in parallel. Each partition is assigned to exactly one consumer within a group, providing parallel processing while maintaining ordering guarantees. Automatic rebalancing handles failures and scaling.
Visual Overview
CONSUMER GROUP ARCHITECTURE:
Topic: user-events (4 partitions)
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
βPart 0 βPart 1 βPart 2 βPart 3 β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
β β β β
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
βConsumer βConsumer βConsumer βConsumer β
β A β B β C β D β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Group: "analytics-processors"
KEY GUARANTEES:
βββ Each partition assigned to exactly ONE consumer in group
βββ Each consumer can handle multiple partitions
βββ Automatic rebalancing on member changes
βββ Fault tolerance through coordinator failover
REBALANCING SCENARIOS:
1. Consumer joins: 4 partitions β 5 consumers (rebalance)
2. Consumer crashes: 4 partitions β 3 consumers (rebalance)
3. Partition added: New partition needs assignment (rebalance)
Core Explanation
What is a Consumer Group?
A consumer group is a logical collection of consumer instances that work together to consume messages from a topic. The group provides:
- Load distribution: Partitions spread across consumers
- Fault tolerance: Failed consumers automatically replaced
- Scaling: Add/remove consumers dynamically
- Coordination: Group coordinator manages partition assignments
Partition Assignment Guarantee
The Golden Rule:
Each partition is assigned to exactly one consumer within a consumer group at any given time.
VALID ASSIGNMENT (4 partitions, 3 consumers):
Consumer A: [Partition 0, Partition 1]
Consumer B: [Partition 2]
Consumer C: [Partition 3]
β Each partition assigned exactly once
INVALID ASSIGNMENT:
Consumer A: [Partition 0]
Consumer B: [Partition 0] β Partition 0 assigned twice!
This guarantee ensures:
- No duplicate processing within a group
- Ordering maintained per partition
- Clear ownership of each partition
How Partition Assignment Works
Assignment Strategies:
// 1. RANGE STRATEGY (default)
// Assigns consecutive partitions to consumers
Topic: user-events (6 partitions)
Consumer A: [0, 1]
Consumer B: [2, 3]
Consumer C: [4, 5]
// Pro: Simple, predictable
// Con: Uneven if partition count doesn't divide evenly
// 2. ROUND-ROBIN STRATEGY
// Distributes partitions one-by-one in round-robin
Topic: user-events (6 partitions)
Consumer A: [0, 3]
Consumer B: [1, 4]
Consumer C: [2, 5]
// Pro: Even distribution
// Con: Less predictable, more partition movement on rebalance
// 3. STICKY STRATEGY
// Minimizes partition movement during rebalance
// Keeps existing assignments when possible
// Pro: Reduces rebalancing overhead
// Con: Slightly more complex
Configuration:
Properties props = new Properties();
props.put(ConsumerConfig.GROUP_ID_CONFIG, "analytics-processors");
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
"org.apache.kafka.clients.consumer.RangeAssignor");
Group Coordinator and Rebalancing
Coordinator Selection:
Group ID: "analytics-processors"
β
hash(groupId) % num_partitions(__consumer_offsets)
β
Partition 23 in __consumer_offsets
β
Broker 2 (leader of partition 23)
β
Broker 2 becomes Group Coordinator
Rebalancing Protocol (Simplified):
REBALANCING FLOW:
1. TRIGGER EVENT
βββ Consumer joins group
βββ Consumer leaves/crashes
βββ Consumer heartbeat timeout
βββ Partition count changes
2. COORDINATOR INITIATES REBALANCE
βββ Sends REBALANCE_IN_PROGRESS to all consumers
βββ Consumers stop processing, commit offsets
3. JOIN GROUP PHASE
βββ All consumers re-join group
βββ Send their supported partition assignment strategies
βββ Coordinator collects member info
4. ASSIGNMENT PHASE
βββ Coordinator runs assignment strategy
βββ Calculates new partition assignments
βββ Sends assignments to consumers
5. RESUME PROCESSING
βββ Consumers start consuming from new assignments
TOTAL REBALANCE TIME: ~500ms to several seconds
Scaling Patterns
Under-Subscribed (Fewer Consumers than Partitions):
4 Partitions, 2 Consumers:
Consumer A: [P0, P1]
Consumer B: [P2, P3]
Throughput: 2x (2 parallel consumers)
Utilization: 100% (all consumers busy)
Fully-Subscribed (Equal Consumers and Partitions):
4 Partitions, 4 Consumers:
Consumer A: [P0]
Consumer B: [P1]
Consumer C: [P2]
Consumer D: [P3]
Throughput: 4x (4 parallel consumers)
Utilization: 100% (optimal)
Over-Subscribed (More Consumers than Partitions):
4 Partitions, 6 Consumers:
Consumer A: [P0]
Consumer B: [P1]
Consumer C: [P2]
Consumer D: [P3]
Consumer E: [] β οΈ IDLE
Consumer F: [] β οΈ IDLE
Throughput: 4x (limited by partitions)
Utilization: 67% (2 consumers wasted)
β Cannot scale beyond partition count!
Multiple Consumer Groups
Independent Processing:
Topic: user-events (4 partitions)
β
βββββΊ Group: "analytics" (processes all events)
β Consumer A: [P0, P1]
β Consumer B: [P2, P3]
β
βββββΊ Group: "fraud-detection" (also processes all events)
Consumer X: [P0, P1, P2, P3]
Each group independently consumes ALL messages.
Groups do NOT affect each other.
Use Case - Multiple Processing Pipelines:
Topic: "user-actions"
Group 1: "real-time-analytics"
β Processes events for live dashboards
Group 2: "ml-feature-pipeline"
β Extracts features for ML models
Group 3: "audit-logger"
β Archives events for compliance
All three groups consume the SAME messages independently.
Tradeoffs
Advantages:
- β Horizontal scalability (add more consumers)
- β Automatic fault tolerance (consumer failures handled)
- β Load balancing across consumers
- β Multiple independent processing pipelines (multiple groups)
Disadvantages:
- β Rebalancing causes processing pause (stop-the-world)
- β Cannot scale beyond partition count
- β Partition assignment may be uneven
- β Rebalancing overhead on frequent consumer changes
Real Systems Using This
Kafka (Apache)
- Implementation: Group coordinator per partition in
__consumer_offsets
- Scale: Thousands of consumer groups processing trillions of messages
- Typical Setup: 10-50 consumers per group for high-throughput topics
Amazon Kinesis
- Implementation: Kinesis Client Library (KCL) provides similar consumer group semantics
- Scale: Auto-scaling consumer groups based on shard count
- Typical Setup: 1 worker per shard, auto-scaling with shard splits/merges
Apache Pulsar
- Implementation: Shared subscription model (similar to consumer groups)
- Scale: Automatic load rebalancing without stop-the-world pauses
- Typical Setup: Dynamic consumer scaling with minimal disruption
When to Use Consumer Groups
β Perfect Use Cases
High-Throughput Event Processing
Scenario: Processing 1M events/sec from user activity stream
Solution: Consumer group with 100 consumers (10K events/sec each)
Result: Linear scaling, automatic fault tolerance
Parallel Data Pipeline
Scenario: Real-time ETL from Kafka to data warehouse
Solution: Consumer group with partitions = number of available cores
Result: Maximize parallelism while maintaining ordering per partition
Multiple Processing Pipelines
Scenario: Same events need processing by analytics, ML, and audit systems
Solution: Three separate consumer groups on same topic
Result: Independent processing without interfering with each other
β When NOT to Use
Need Broadcast to All Consumers
Problem: Every consumer must receive ALL messages
Issue: Consumer groups distribute messages (each gets subset)
Alternative: Use separate consumer groups or pub-sub pattern
Very Low Latency Requirements
Problem: Sub-millisecond latency critical
Issue: Rebalancing causes temporary processing pause
Alternative: Single consumer or fixed partition assignment
More Consumers than Partitions Long-Term
Problem: Want to run 100 consumers with only 10 partitions
Issue: 90 consumers will be idle, wasting resources
Alternative: Increase partition count or reduce consumers
Interview Application
Common Interview Question 1
Q: βYou have a topic with 10 partitions. If you deploy 15 consumers in the same consumer group, what happens?β
Strong Answer:
βOnly 10 consumers will be active - one per partition. The remaining 5 consumers will be idle since each partition can only be assigned to one consumer in a group. This is inefficient. To utilize all 15 consumers, Iβd either increase the partition count to 15+, or split the workload across multiple topics. If scaling further is anticipated, Iβd over-provision partitions upfront since changing partition count requires topic recreation.β
Why this is good:
- Shows understanding of partition assignment constraint
- Identifies the inefficiency
- Provides multiple solutions
- Considers future scaling
Common Interview Question 2
Q: βWhat happens during a consumer group rebalance? How does it affect processing?β
Strong Answer:
βRebalancing occurs when consumers join, leave, or crash. The process:
- Coordinator detects the change (heartbeat timeout or explicit notification)
- Sends REBALANCE_IN_PROGRESS to all group members
- Consumers stop processing and commit their offsets
- All consumers re-join the group
- Coordinator calculates new partition assignments using the configured strategy
- Consumers receive new assignments and resume processing
Impact: Processing pauses for ~500ms to several seconds. In production, we minimize rebalances by:
- Using static membership (Kafka 2.3+) to avoid rebalances on restarts
- Tuning
session.timeout.ms
andheartbeat.interval.ms
- Using sticky assignor to minimize partition movement
- Graceful shutdowns with proper leave group notificationsβ
Why this is good:
- Detailed step-by-step understanding
- Quantifies the impact
- Shows production awareness
- Provides optimization strategies
Red Flags to Avoid
- β Confusing consumer groups with partition replicas
- β Claiming you can assign same partition to multiple consumers in one group
- β Not knowing about rebalancing and its impact
- β Forgetting that consumer count cannot exceed partition count for effectiveness
Quick Self-Check
Before moving on, can you:
- Explain consumer groups in 60 seconds?
- Draw a diagram showing partition-to-consumer assignment?
- Explain what triggers a rebalance?
- Calculate optimal consumer count given partition count?
- Identify when to use multiple consumer groups?
- Explain the partition assignment guarantee?
Related Content
Prerequisites
- Topic Partitioning - Understanding partitions is essential for consumer groups
Related Concepts
- Offset Management - How consumers track their position
- Load Balancing - Distribution strategies
- Sharding - Similar concept for databases
Used In Systems
- Real-Time Analytics Pipeline - Consumer groups for parallel processing
- Event-Driven Microservices - Multiple consumer groups per service
Explained In Detail
- Kafka Architecture - Consumer Groups & Rebalancing section (30 minutes)
- Deep dive into rebalancing protocols, partition assignment strategies, and coordinator mechanics
Next Recommended: Offset Management - Learn how consumers track their position in partitions