Leader-Follower Replication

TL;DR

Leader-follower replication is a pattern where one node (the leader) handles all writes and replicates data to multiple follower nodes that serve reads and provide fault tolerance. If the leader fails, a follower is promoted to become the new leader. This pattern achieves high availability, fault tolerance, and read scalability.

Visual Overview

Leader-Follower Architecture

Core Explanation

What is Leader-Follower Replication?

Leader-follower replication (also called master-slave or primary-secondary) is a replication pattern where:

One Leader: Handles all writes, maintains authoritative copy
Multiple Followers: Replicate leader’s data, can serve reads
Automatic Failover: Follower promoted to leader on failure
Consistency: Leader ensures all replicas converge to same state

Single Node vs Leader-Follower

Replication Modes: Synchronous vs Asynchronous

Synchronous Replication (Strong Consistency):

Synchronous Replication

Asynchronous Replication (Eventual Consistency):

Asynchronous Replication

Hybrid: In-Sync Replicas (ISR) - Kafka’s Approach:

In-Sync Replicas (ISR)

Leader Election and Failover

When Does Failover Happen?

Failure Detection

Leader Election Process:

Leader Election Process

Kafka’s Controller-Based Election:

Kafka Controller Architecture

Replication Lag and ISR Management

What is Replication Lag?

Replication Lag

ISR Dynamics:

ISR Dynamics

Read Patterns

Read from Leader (Strong Consistency):

Read from Leader

Read from Followers (Eventual Consistency):

Read from Followers

Hybrid: Read-Your-Writes Consistency:

Read-Your-Writes Consistency

Tradeoffs

Advantages:

✓ Fault tolerance (survive N-1 failures with N replicas)
✓ High availability (automatic failover)
✓ Read scalability (distribute reads to followers)
✓ Data durability (multiple copies)

Disadvantages:

✕ Write latency (replication overhead)
✕ Consistency complexity (sync vs async tradeoffs)
✕ Failover time (10-30s downtime during leader election)
✕ Split-brain risk (requires external coordinator)

Real Systems Using This

Apache Kafka

Implementation: Leader per partition, ISR-based replication
Scale: 3-5 replicas typical, 7+ for critical data
Failover: Controller-based election, ~10s failover time
Typical Setup: replication.factor=3, min.insync.replicas=2

MongoDB

Implementation: Replica sets with primary and secondaries
Scale: 3-7 replicas per replica set
Failover: Raft-based election, ~10-40s failover
Typical Setup: 3 replicas, read preference “primaryPreferred”

PostgreSQL

Implementation: Streaming replication (WAL-based)
Scale: 1 primary + N standbys
Failover: Manual or automatic (with tools like Patroni)
Typical Setup: 1 primary + 2 standbys, async replication

Redis

Implementation: Master-slave replication
Scale: 1 master + multiple slaves
Failover: Redis Sentinel for automatic failover
Typical Setup: 1 master + 2 slaves + 3 Sentinel nodes

When to Use Leader-Follower Replication

✓ Perfect Use Cases

High Availability Critical Systems

High Availability Critical Systems

Read-Heavy Workloads

Read-Heavy Workloads

Geo-Distributed Reads

Geo-Distributed Reads

✕ When NOT to Use

Multi-Region Writes

Multi-Region Writes

Need for Strong Consistency Reads

Need for Strong Consistency Reads

Extremely High Write Throughput

Extremely High Write Throughput

Interview Application

Common Interview Question 1

Q: “Design a highly available message queue. How would you handle broker failures?”

Strong Answer:

“I’d use leader-follower replication with in-sync replicas (Kafka’s model):

Architecture:

Each partition has replication.factor=3 (1 leader + 2 followers)

min.insync.replicas=2 (leader + at least 1 follower must ACK)

Controller broker manages leader elections

Normal Operation:

Producers write to partition leader

Leader replicates to followers in parallel

ACK to producer after min 2 replicas confirm

Consumers read from leader (or followers for lower priority)

Failure Handling:

Follower failure: Removed from ISR, writes continue with remaining ISR

Leader failure: Controller elects new leader from ISR within 10-30s

Network partition: Rely on ZooKeeper quorum to prevent split-brain

Trade-offs:

Synchronous to ISR = no data loss but slightly higher latency

Async to non-ISR replicas = fast writes but potential data loss on leader crash

This is exactly how Kafka achieves 99.99%+ availability at LinkedIn scale.”

Why this is good:

Specific configuration values
Handles multiple failure scenarios
Explains trade-offs clearly
References real-world implementation

Common Interview Question 2

Q: “What’s the difference between synchronous and asynchronous replication? When would you use each?”