Mechanisms by which message producers receive confirmation that their messages were successfully persisted, enabling reliability tradeoffs between latency and durability
Producer acknowledgments (acks) control when Kafka considers a message successfully written. Options include acks=0 (no confirmation), acks=1 (leader confirms), and acks=all (all replicas confirm), trading latency for durability guarantees. Critical for balancing performance vs data safety in message brokers.
Visual Overview
Producer Acknowledgments Overview
Producer Acknowledgments Overview
ACKS = 0 (Fire and Forget)
┌────────────────────────────────────────────────┐│ Producer → Message → Kafka Leader ││↓ (don't wait) ││Immediate return✓││Latency: <1ms ││││ Risk: Message may be lost if: ││ - Network failure before reaching leader ││ - Leader crashes before writing to disk ││ - Leader crashes before replication ││││ Use case: Metrics, logs (lossy OK) │└────────────────────────────────────────────────┘ACKS = 1 (Leader Acknowledgment)
┌────────────────────────────────────────────────┐│ Producer → Message → Kafka Leader ││↓││Write to log✓││↓││Send ACK → Producer ││Latency: 5-10ms ││││ Meanwhile (async): ││ Leader →Replicate→ Follower 1 ││ Leader →Replicate→ Follower 2 ││││ Risk: Message lost if leader crashes││ before replication completes ││││ Use case: Most production workloads (default) │└────────────────────────────────────────────────┘ACKS = ALL (Full Quorum)
┌────────────────────────────────────────────────┐│ Producer → Message → Kafka Leader ││↓││Write to log││↓││Replicate to all ISR replicas ││↓││ Follower 1: Written✓││ Follower 2: Written✓││↓││Send ACK → Producer ││Latency: 10-50ms (network + replication) ││││ Risk: Message never lost (replicated) ││ (unless all ISR replicas fail simultaneously) ││││ Use case: Financial transactions, orders │└────────────────────────────────────────────────┘TIMELINE COMPARISON:
┌────────────────────────────────────────────────┐│ acks=0: ││ T0: Send││ T1: Return (1ms) ✓││││ acks=1: ││ T0: Send││ T5: Leader writes ││ T10: Return (10ms) ✓││││ acks=all: ││ T0: Send││ T5: Leader writes ││ T15: Follower 1 writes ││ T20: Follower 2 writes ││ T25: Return (25ms) ✓│└────────────────────────────────────────────────┘
Core Explanation
What are Producer Acknowledgments?
Producer acknowledgments (acks) control when a Kafka producer considers a write operation successful. This determines:
When producer receives confirmation that message is safe
How many replicas must persist the message
Trade-off between latency and durability
Three Levels:
Three Acknowledgment Levels
Three Acknowledgment Levels
acks=0: No acknowledgment (fire-and-forget)
acks=1: Leader acknowledgment (default)
acks=all: Full ISR acknowledgment (safest)
acks=0: No Acknowledgment
Behavior:
acks=0 Behavior
acks=0 Behavior
Producersends message, immediately considers it sent
Leaderreceives message (maybe)
No confirmation sent back
Result:
• Highest throughput (no waiting)
• Lowest latency (<1ms)
• Zero durability guarantee
When Message Can Be Lost:
acks=0 Message Loss Scenarios
acks=0 Message Loss Scenarios
1. Network failure before reaching broker
Producer → [Network drops packet] → Leader (never arrives)
2. Leader crash before writing to disk
Producer → Leader (in memory) → [Crash] ✗
3. Leader crash before replication
Producer → Leader (written) → [Crash before replicating] ✗Probability of loss: Relatively high (1-5%)
Configuration:
const producer = kafka.producer({ acks: 0, // No acknowledgment compression: "gzip", // Often used with acks=0 for max throughput});
Use Cases:
acks=0 Use Cases
acks=0 Use Cases
✓Log aggregation (OK to lose some logs)
✓Metrics collection (OK to lose some data points)
✓IoT sensor data (high volume, redundancy)
✓Clickstream tracking (lossy acceptable)
✗Financial transactions✗User-facing data (messages, posts)
✗Critical business events
acks=1: Leader Acknowledgment
Behavior:
acks=1 Behavior
acks=1 Behavior
Producersends message
Leaderwrites to local log (durable on leader disk)
LeadersendsACK to producer
Producer considers message sent ✓Meanwhile (asynchronous):
Leaderreplicates to followers (background)
Result:
• Good throughput
• Moderate latency (5-10ms)
• Durability: Survives producer/network failure
• Risk: Lost if leader fails before replication
When Message Can Be Lost:
acks=1 Message Loss Scenario
acks=1 Message Loss Scenario
Scenario: Leader fails before replication
T0: Producer → Leader (message written to leader)
T1: Leader →ACK→ Producer ✓
T2: Producer moves on
T3: Leader crashes ⚡ (before replicating)
T4: Follower promoted to new leader
T5: Message is GONE✗ (was only on failed leader)
Probability: Low (1-2% during failures)
Window of vulnerability: ~500ms (replication lag)
✓ Most production workloads (default choice)
✓High-throughput messaging
✓Real-time analytics✓Event streamingBalance between performance and safety
acks=all: Full ISR Acknowledgment
Behavior:
acks=all Behavior
acks=all Behavior
Producersends message
Leaderwrites to local log
Leaderwaits for ALL in-sync replicas (ISR) to acknowledge
All ISR replicas write to their logs
LeadersendsACK to producer
Producer considers message sent ✓Result:
• Lower throughput
• Higher latency (10-50ms)
• Maximum durability
• Message replicated before acknowledgment
In-Sync Replicas (ISR):
In-Sync Replicas (ISR)
In-Sync Replicas (ISR)
ISR = Set of replicas that are "caught up" with leader
Example:
• Leader: Broker 1
• Followers: Broker 2 (in sync), Broker 3 (lagging)
• ISR = {Broker 1, Broker 2}
acks=all waits for: Broker 1 + Broker 2
If Broker 2 falls behind (network issue):
ISR = {Broker 1} (just leader)
acks=all waits for: Broker 1 only (no followers!)
This is why min.insync.replicas is critical!
min.insync.replicas:
min.insync.replicas Configuration
min.insync.replicas Configuration
Configuration: Minimum ISR size required for writes
min.insync.replicas=2 (recommended for acks=all)
• Requires at least 2 replicas in ISR
• If ISR shrinks to 1, producer gets error
• Prevents data loss when only leader is alive
Example with 3 replicas:
┌─────────────────────────────────────────────┐│ Normal: ISR = {Leader, Follower1, Follower2││ acks=all waits for Leader + Follower1 ││ (or Leader + Follower2, first to respond) ││││ Follower1 fails: ISR = {Leader, Follower2} ││ acks=all waits for Leader + Follower2 ✓││││ Follower2 also fails: ISR = {Leader} ││ acks=all REJECTS writes✗││ (ISR size 1 < min.insync.replicas 2) │└─────────────────────────────────────────────┘Protection: Cannot lose data if leader fails,
because message is on at least 2 replicas
Configuration:
const producer = kafka.producer({ acks: -1, // -1 means "all" (acks=all) timeout: 30000, retry: { retries: 5, },});// Topic configurationmin.insync.replicas = 2; // At least 2 replicas must ackreplication.factor = 3; // Total of 3 replicas
Use Cases:
acks=all Use Cases
acks=all Use Cases
✓Financial transactions✓E-commerce orders
✓User-generated content (posts, messages)
✓Critical business events
✓Regulatory/compliance data
Anywhere data loss is unacceptable
Real Systems Using Producer Acks
System
Default acks
Typical Config
Rationale
Kafka Streams
acks=all
acks=all, min.insync.replicas=2
State stores require durability
Netflix (Keystone)
acks=1
acks=1, replication=3
High throughput, tolerate rare loss
LinkedIn
acks=all
acks=all, min.insync.replicas=2
Business-critical events
Uber
acks=1
acks=1 (logs), acks=all (trips)
Mixed based on data criticality
Confluent Cloud
acks=all
acks=all, min.insync.replicas=2
Default for safety
Case Study: Kafka at LinkedIn
LinkedIn Kafka Acknowledgment Strategy
LinkedIn Kafka Acknowledgment Strategy
LinkedIn's Kafka usage (origin of Kafka):
• 100+ billion messages/day
• 1000s of topics
• Multi-datacenter deployment
Acknowledgment Strategy:
┌───────────────────────────────────────────┐│Critical Data (jobs, connections): ││ • acks=all ││ • min.insync.replicas=2 ││ • replication.factor=3 ││→ Latency: 20-30ms ││→Zero data loss││││Metrics/Logs (high volume): ││ • acks=1 ││ • replication.factor=2 ││→ Latency: 5-10ms ││→Acceptable loss rate: <0.1% ││││Analytics Events (ultra-high volume): ││ • acks=0 ││ • compression=gzip ││→ Latency: 1-2ms ││→Loss rate: 1-2% (acceptable) │└───────────────────────────────────────────┘Lesson: Different acks for different data criticality
When to Use Each Ack Level
acks=0: Fire and Forget
Use When:
acks=0 When to Use
acks=0 When to Use
✓High throughput required (100k+ msg/sec)
✓Data loss is acceptable (logs, metrics)
✓ Data has natural redundancy (sensor arrays)
✓Ultra-low latency required (<1ms)
Example: IoT sensor network
• 1000 sensors sending data every second
• If 1% of readings lost, still have 99%
• Aggregate statistics still accurate
acks=1: Leader Only
Use When:
acks=1 When to Use
acks=1 When to Use
✓Good balance of performance and safety
✓Occasional loss acceptable during failures
✓ High throughput with moderate durability
✓Default choice for most workloads
Example: User activity tracking
• Click events, page views, etc.
• Occasional loss during broker failure OK
• Still maintain 99%+ delivery
acks=all: Full Replication
Use When:
acks=all When to Use
acks=all When to Use
✓Zero data loss required
✓Regulatory/compliance requirements
✓ Financial or critical business data
✓ Can tolerate higher latency (10-50ms)
Example: E-commerce order placement
• User places order (creates Kafka event)
• Order must not be lost
• OK to wait 20-30ms for full replication
• Worth latency cost for safety