Skip to content

Leader-Follower Replication

How distributed systems achieve fault tolerance and high availability by replicating data from a leader node to multiple follower nodes

TL;DR

Leader-follower replication is a pattern where one node (the leader) handles all writes and replicates data to multiple follower nodes that serve reads and provide fault tolerance. If the leader fails, a follower is promoted to become the new leader. This pattern achieves high availability, fault tolerance, and read scalability.

Visual Overview

Leader-Follower Architecture

Core Explanation

What is Leader-Follower Replication?

Leader-follower replication (also called master-slave or primary-secondary) is a replication pattern where:

  • One Leader: Handles all writes, maintains authoritative copy
  • Multiple Followers: Replicate leader’s data, can serve reads
  • Automatic Failover: Follower promoted to leader on failure
  • Consistency: Leader ensures all replicas converge to same state
Single Node vs Leader-Follower

Replication Modes: Synchronous vs Asynchronous

Synchronous Replication (Strong Consistency):

Synchronous Replication

Asynchronous Replication (Eventual Consistency):

Asynchronous Replication

Hybrid: In-Sync Replicas (ISR) - Kafka’s Approach:

In-Sync Replicas (ISR)

Leader Election and Failover

When Does Failover Happen?

Failure Detection

Leader Election Process:

Leader Election Process

Kafka’s Controller-Based Election:

Kafka Controller Architecture

Replication Lag and ISR Management

What is Replication Lag?

Replication Lag

ISR Dynamics:

ISR Dynamics

Read Patterns

Read from Leader (Strong Consistency):

Read from Leader

Read from Followers (Eventual Consistency):

Read from Followers

Hybrid: Read-Your-Writes Consistency:

Read-Your-Writes Consistency

Tradeoffs

Advantages:

  • ✓ Fault tolerance (survive N-1 failures with N replicas)
  • ✓ High availability (automatic failover)
  • ✓ Read scalability (distribute reads to followers)
  • ✓ Data durability (multiple copies)

Disadvantages:

  • ✕ Write latency (replication overhead)
  • ✕ Consistency complexity (sync vs async tradeoffs)
  • ✕ Failover time (10-30s downtime during leader election)
  • ✕ Split-brain risk (requires external coordinator)

Real Systems Using This

Apache Kafka

  • Implementation: Leader per partition, ISR-based replication
  • Scale: 3-5 replicas typical, 7+ for critical data
  • Failover: Controller-based election, ~10s failover time
  • Typical Setup: replication.factor=3, min.insync.replicas=2

MongoDB

  • Implementation: Replica sets with primary and secondaries
  • Scale: 3-7 replicas per replica set
  • Failover: Raft-based election, ~10-40s failover
  • Typical Setup: 3 replicas, read preference “primaryPreferred”

PostgreSQL

  • Implementation: Streaming replication (WAL-based)
  • Scale: 1 primary + N standbys
  • Failover: Manual or automatic (with tools like Patroni)
  • Typical Setup: 1 primary + 2 standbys, async replication

Redis

  • Implementation: Master-slave replication
  • Scale: 1 master + multiple slaves
  • Failover: Redis Sentinel for automatic failover
  • Typical Setup: 1 master + 2 slaves + 3 Sentinel nodes

When to Use Leader-Follower Replication

✓ Perfect Use Cases

High Availability Critical Systems

High Availability Critical Systems

Read-Heavy Workloads

Read-Heavy Workloads

Geo-Distributed Reads

Geo-Distributed Reads

✕ When NOT to Use

Multi-Region Writes

Multi-Region Writes

Need for Strong Consistency Reads

Need for Strong Consistency Reads

Extremely High Write Throughput

Extremely High Write Throughput

Interview Application

Common Interview Question 1

Q: “Design a highly available message queue. How would you handle broker failures?”

Strong Answer:

“I’d use leader-follower replication with in-sync replicas (Kafka’s model):

Architecture:

  • Each partition has replication.factor=3 (1 leader + 2 followers)
  • min.insync.replicas=2 (leader + at least 1 follower must ACK)
  • Controller broker manages leader elections

Normal Operation:

  • Producers write to partition leader
  • Leader replicates to followers in parallel
  • ACK to producer after min 2 replicas confirm
  • Consumers read from leader (or followers for lower priority)

Failure Handling:

  • Follower failure: Removed from ISR, writes continue with remaining ISR
  • Leader failure: Controller elects new leader from ISR within 10-30s
  • Network partition: Rely on ZooKeeper quorum to prevent split-brain

Trade-offs:

  • Synchronous to ISR = no data loss but slightly higher latency
  • Async to non-ISR replicas = fast writes but potential data loss on leader crash

This is exactly how Kafka achieves 99.99%+ availability at LinkedIn scale.”

Why this is good:

  • Specific configuration values
  • Handles multiple failure scenarios
  • Explains trade-offs clearly
  • References real-world implementation

Common Interview Question 2

Q: “What’s the difference between synchronous and asynchronous replication? When would you use each?”

Strong Answer:

Synchronous Replication:

  • Leader waits for follower ACKs before responding to client
  • Guarantees: No data loss (all replicas have data)
  • Trade-off: Higher latency, lower availability (blocked if follower down)
  • Use case: Financial transactions, critical metadata

Asynchronous Replication:

  • Leader responds immediately, replicates in background
  • Guarantees: Low latency, high availability
  • Trade-off: Potential data loss if leader crashes before replication
  • Use case: Analytics logs, user activity streams

In Production: Most systems use a hybrid like Kafka’s ISR:

  • Synchronous to a quorum (e.g., 2 out of 3 replicas)
  • Asynchronous to remaining replicas
  • Dynamically remove slow replicas from ISR to maintain availability
  • Result: Balance between durability and performance

For example, at Uber, we’d use sync replication for payment events (can’t lose money) but async for GPS location updates (can tolerate occasional loss).”

Why this is good:

  • Clear comparison of both approaches
  • Explains when to use each
  • Mentions hybrid approach (real-world)
  • Concrete examples for each use case

Red Flags to Avoid

  • ✕ Not understanding the difference between sync and async replication
  • ✕ Ignoring split-brain scenarios and how to prevent them
  • ✕ Thinking failover is instant (it takes 10-30s typically)
  • ✕ Not considering replication lag impact on read consistency

Quick Self-Check

Before moving on, can you:

  • Explain leader-follower replication in 60 seconds?
  • Draw the write and read flow diagrams?
  • Compare synchronous vs asynchronous replication?
  • Explain how leader election works?
  • Describe ISR (in-sync replicas) concept?
  • Identify when to use vs NOT use this pattern?

See It In Action

Prerequisites

None - this is a foundational distributed systems pattern

Used In Systems

  • Distributed Databases - PostgreSQL replication
  • Message Queues - Kafka replication

Explained In Detail

  • Kafka Architecture - Replication and ISR (45 minutes)
  • Deep dive into partition leadership, ISR management, and controller election

Next Recommended: Consensus - Learn how distributed systems agree on a single leader

Interview Notes
⭐ Must-Know
💼85% of distributed systems interviews
Interview Relevance
85% of distributed systems interviews
🏭Kafka, MongoDB, PostgreSQL
Production Impact
Powers systems at Kafka, MongoDB, PostgreSQL
99.99%+ uptime
Performance
99.99%+ uptime query improvement
📈7+ trillion messages/day (LinkedIn)
Scalability
7+ trillion messages/day (LinkedIn)