Automatic switching to a backup system or replica when the primary fails, ensuring service continuity with minimal downtime
70% of HA design interviews
Powers systems at Netflix, AWS, Google
99.99%+ uptime query improvement
RTO/RPO optimization
TL;DR
Failover is the process of automatically transferring operations from a failed primary system to a standby backup, minimizing downtime and ensuring service continuity. Itβs essential for high-availability systems, enabling recovery from hardware failures, software crashes, and network issues within seconds to minutes.
Visual Overview
NORMAL OPERATION (Leader-Follower)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Clients β Primary/Leader β
β β β
β [Replicating] β
β β β
β Standby/Follower (Hot/Warm/Cold) β
β β
β Status: Primary is HEALTHY β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
FAILURE DETECTION
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Heartbeat Monitor β
β β β
β Primary β (No heartbeat for 10 seconds) β
β β β
β Failure Threshold Exceeded β
β β β
β TRIGGER FAILOVER! β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
FAILOVER PROCESS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β T0: Primary fails (crash, network partition) β
β T1: Health checker detects failure (5-10s) β
β T2: Standby promoted to new primary (2-5s) β
β T3: Clients redirected to new primary (1-2s) β
β T4: Service restored β
β β
β Total downtime: 8-17 seconds β
β β
β New topology: β
β Clients β New Primary (old Standby) β β
β β β
β Old Primary (offline) β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SPLIT-BRAIN PROBLEM (What Can Go Wrong)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Network Partition: β
β β
β Region A β Region B β
β Primary β β Standby β
β (still alive) β (promoted to Primary) β β
β β β β β
β Clients A β Clients B β
β write X=1 β write X=2 β
β β β
β Result: TWO PRIMARIES = DATA DIVERGENCE β β
β β
β Solution: Fencing (kill old primary with authority)β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Explanation
What is Failover?
Failover is the automatic or manual process of switching from a failed primary system to a standby backup system to maintain service availability. Itβs a key component of high-availability (HA) architectures.
Key Components:
- Primary/Leader: Active system handling requests
- Standby/Follower: Backup system ready to take over
- Health Monitor: Detects primary failures via heartbeats
- Failover Orchestrator: Promotes standby to primary
- Fencing Mechanism: Prevents split-brain scenarios
Standby Types
1. Hot Standby (Active-Passive)
Setup:
Primary: Handles all traffic
Standby: Fully synced, ready to serve immediately
Replication: Continuous, synchronous or near-sync
Failover Time: Seconds (fastest)
Cost: High (standby hardware running idle)
Example:
Primary DB: PostgreSQL with streaming replication
Standby DB: Replica in sync, can become primary instantly
Use Case: Financial systems, databases (99.99% uptime)
2. Warm Standby
Setup:
Primary: Handles all traffic
Standby: Partially synced, needs brief preparation
Replication: Periodic or continuous async
Failover Time: Minutes
Cost: Medium (standby runs but lighter workload)
Example:
Primary: Application server with session state
Standby: Server running but not in load balancer
Use Case: Web applications, moderate SLA (99.9% uptime)
3. Cold Standby
Setup:
Primary: Handles all traffic
Standby: Offline, restored from backup
Replication: Periodic backups only
Failover Time: Hours (restore from backup + configure)
Cost: Low (standby hardware can be repurposed)
Example:
Primary: Production database
Standby: Backup snapshots on S3, restore to new instance
Use Case: Development systems, cost-sensitive applications
Failure Detection
Heartbeat Monitoring:
HEARTBEAT PROTOCOL:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Every 1-5 seconds: β
β β
β Primary β "I'm alive" β Health Monitor β
β β
β If no heartbeat for N seconds: β
β - N = 10s: Aggressive (false positives) β
β - N = 30s: Conservative (slower failover) β
β - N = 10s with 3 retries: Balanced β β
βββββββββββββββββββββββββββββββββββββββββββββββ
What to Monitor:
β Process is running (liveness probe)
β Service is responsive (readiness probe)
β Database connections work
β Disk space available
β CPU/Memory not exhausted
False Positive Causes:
- Temporary network glitch
- GC pause (Java stop-the-world)
- High CPU causing timeout
- Switch/router failure (not server failure)
Mitigation:
- Multiple independent monitors
- Quorum-based decision (3/5 monitors agree)
- Grace period + retries
Failover Process Steps
Step-by-Step Breakdown:
1. FAILURE DETECTION (5-30s)
- Health monitor misses N heartbeats
- Multiple monitors reach consensus
- Declare primary unhealthy
2. FENCING (1-5s)
- Prevent split-brain
- Kill old primary (STONITH: Shoot The Other Node In The Head)
- Revoke old primary's network access
- Acquire distributed lock/lease
3. PROMOTION (2-10s)
- Standby catches up replication lag (if any)
- Standby promoted to primary role
- Update metadata (e.g., Kafka controller registry)
4. DNS/ROUTING UPDATE (1-60s)
- Update DNS to point to new primary
- Or update load balancer configuration
- Or use floating IP (instant failover)
5. CLIENT RECONNECTION (0-30s)
- Clients detect connection failure
- Retry with backoff
- Discover new primary endpoint
- Resume operations
Total Downtime: 8-135 seconds (depends on config)
RTO vs RPO
Recovery Time Objective (RTO):
RTO = Maximum acceptable downtime
Examples:
- E-commerce site: RTO = 5 minutes (lose sales)
- Payment processing: RTO = 30 seconds (critical)
- Analytics dashboard: RTO = 30 minutes (non-critical)
Factors affecting RTO:
- Standby type (hot < warm < cold)
- Failure detection time
- Promotion process complexity
- Client reconnection speed
Recovery Point Objective (RPO):
RPO = Maximum acceptable data loss
Examples:
- Bank transactions: RPO = 0 (no loss acceptable)
- User comments: RPO = 5 minutes (acceptable)
- Log aggregation: RPO = 1 hour (can lose some logs)
Factors affecting RPO:
- Replication strategy (sync vs async)
- Commit acknowledgment (quorum required?)
- Checkpoint frequency
- Backup frequency (for cold standby)
TRADEOFF:
Low RPO (sync replication) = Higher latency
High RPO (async replication) = Lower latency, more data loss
Split-Brain Problem
What is Split-Brain?
Network partition causes both nodes to think they're primary:
Before partition:
Primary A (serving traffic) β Health Monitor β Standby B
After partition:
Primary A (still serving) β Standby B (promoted, now serving)
β β β
Clients in Region A β Clients in Region B
write X=1 β write X=2
PROBLEM: Divergent data, corruption when partition heals
Prevention Techniques:
1. Fencing (STONITH)
Before promoting standby, KILL old primary
Methods:
- Power off via IPMI/iLO
- Network isolation (block at switch)
- Kernel panic trigger
- Forceful process termination
Use case: Shared storage systems (SAN)
2. Distributed Locks / Leases
Acquire lock before becoming primary
Example (etcd):
1. Standby tries to acquire lease: /leader
2. Only succeeds if old primary's lease expired
3. Old primary cannot write without valid lease
4. Result: At most one primary at a time β
Use case: Kafka controller election
3. Quorum / Witness Node
Use majority voting to determine primary
Setup: 3 nodes (A, B, C)
- A is primary (has 2/3 votes)
- Network partition: A isolated, B+C can see each other
- B+C have majority (2/3), can elect new primary
- A has minority (1/3), cannot remain primary
- Result: Only B or C can be new primary
Use case: Distributed databases (Cassandra, Riak)
4. Generation Number / Epoch
Increment version number on each failover
Primary A has epoch=5
After failover, Primary B has epoch=6
If A comes back, it sees epoch=5 < 6
A demotes itself and syncs from B
Use case: Kafka, ZooKeeper
Real Systems Using Failover
System | Failover Type | Detection Time | RTO | RPO | Split-Brain Prevention |
---|---|---|---|---|---|
PostgreSQL | Hot Standby (streaming replication) | 10-30s | ~30s | 0-5s | Witness server, fencing |
MySQL | Semi-sync replication | 10-30s | ~1min | 0 (sync ack) | GTID, virtual IP |
Redis Sentinel | Hot Standby (Sentinel monitors) | 5-15s | ~10s | 0-1s | Quorum (majority of Sentinels) |
Kafka Controller | Hot Standby (ZooKeeper election) | 10-30s | ~20s | 0 (committed log) | ZooKeeper leader election, epoch |
AWS RDS | Multi-AZ (automated failover) | 30-60s | 60-120s | 0 | AWS orchestration |
Cassandra | Leaderless (no failover needed) | N/A | N/A | N/A | Quorum reads/writes |
Case Study: PostgreSQL Failover
PostgreSQL Streaming Replication + Patroni
Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Primary (Leader) β
β β (continuous WAL streaming) β
β Standby 1 (sync replica, zero lag) β
β Standby 2 (async replica, small lag) β
β β
β Health Monitor: Patroni (uses etcd for DCS) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Failure Scenario:
1. Primary crashes (hardware failure)
2. Patroni on each node detects (10s heartbeat miss)
3. Standbys try to acquire leadership lock in etcd
4. Standby 1 (most up-to-date) acquires lock
5. Standby 1 runs pg_ctl promote
6. Standby 1 becomes new primary (~5s)
7. Patroni updates DNS or floating IP (2s)
8. Applications reconnect to new primary (~5s)
Total downtime: ~22 seconds
RPO: 0 (sync replication to Standby 1)
Configuration:
synchronous_standby_names = 'standby1'
synchronous_commit = on
wal_level = replica
max_wal_senders = 5
Result: Zero data loss, ~20s downtime
Case Study: Kafka Controller Failover
Kafka Controller Election (via ZooKeeper)
Normal Operation:
- One broker is controller (manages partition leaders)
- Controller broker has ephemeral node in ZooKeeper: /controller
Failure:
1. Controller broker crashes
2. ZooKeeper detects session timeout (6-10s)
3. ZooKeeper deletes /controller ephemeral node
4. All brokers watch /controller for changes
5. Brokers race to create /controller node
6. First to create becomes new controller
7. New controller loads cluster metadata from ZK
8. New controller sends LeaderAndIsr requests to brokers
9. Partition leaders updated, cluster operational
Total downtime: ~10-20s (partition leadership updates)
RPO: 0 (committed messages replicated to ISR)
Split-Brain Prevention:
- ZooKeeper ensures only one /controller node
- Controller Epoch incremented on each election
- Brokers reject requests with old epoch
When to Use Failover
Use Failover When:
High Availability Required
Scenario: E-commerce checkout service
Requirement: 99.95% uptime (4 hours downtime/year)
Solution: Hot standby with automatic failover
Trade-off: Cost of redundant infrastructure
Data Loss Unacceptable
Scenario: Payment transaction database
Requirement: Zero data loss (RPO = 0)
Solution: Synchronous replication to hot standby
Trade-off: Higher write latency
RTO Measured in Seconds/Minutes
Scenario: Live video streaming control plane
Requirement: Failover < 30 seconds
Solution: Hot standby with fast health checks
Trade-off: False positives from aggressive timeouts
When NOT to Use Failover:
Stateless Services
Problem: Failover is overkill
Solution: Use load balancer with multiple active instances
Example: Stateless REST APIs, web servers
Benefit: Simpler, no failover orchestration needed
Eventually Consistent Systems
Problem: Failover adds complexity without benefit
Solution: Multi-master or leaderless replication
Example: Cassandra, DynamoDB (quorum writes)
Benefit: No single point of failure, continuous availability
Cost-Sensitive Non-Critical Systems
Problem: Hot standby doubles infrastructure cost
Solution: Use cold standby (restore from backup)
Example: Development databases, analytics pipelines
Benefit: Save money, accept longer downtime
Interview Application
Common Interview Question
Q: βDesign a highly available database for a payment system. How would you handle primary database failure?β
Strong Answer:
βFor a payment system where data loss is unacceptable, Iβd design a hot standby failover system with these characteristics:
Architecture:
- Primary database with synchronous replication to hot standby
- Hot standby in different availability zone (same region for low latency)
- Health monitoring via Patroni or similar tool (10-second heartbeat)
- Distributed coordination using etcd or ZooKeeper for split-brain prevention
Failure Handling:
- Detection (10s): Patroni detects primary unresponsive after 2 missed heartbeats
- Fencing (2s): Revoke old primaryβs write access via network isolation
- Promotion (5s): Standby promoted to primary, acquires leadership lock in etcd
- Routing (5s): Update floating IP or DNS to point to new primary
- Reconnection (5s): Applications retry connections, resume transactions
Guarantees:
- RTO: ~27 seconds (acceptable for payment system)
- RPO: 0 seconds (synchronous replication means zero data loss)
- Consistency: Transactions committed to both primary and standby before ACK
Trade-offs:
- Latency: Synchronous replication adds ~5-10ms to write latency
- Cost: Hot standby doubles database infrastructure cost
- Complexity: Patroni/etcd adds operational complexity
Split-Brain Prevention:
- Use distributed lock in etcd (only one primary can hold lock)
- Primary must renew lease every 5 seconds or lose write access
- Generation numbers (epochs) to detect stale primaries
Real-World Example:
- Similar to AWS RDS Multi-AZ: synchronous replication, automatic failover
- Or PostgreSQL + Patroni (used by Zalando, widely adopted)β
Why This Answer Works:
- Identifies appropriate failover type (hot standby) for use case
- Explains step-by-step process with timing
- Discusses RTO/RPO trade-offs explicitly
- Addresses split-brain problem proactively
- References real implementations
Code Example
Implementing Simple Failover with Health Checks
// Health Monitor for Failover Detection
class HealthMonitor {
constructor(primaryUrl, standbyUrl, config) {
this.primaryUrl = primaryUrl;
this.standbyUrl = standbyUrl;
this.config = {
heartbeatInterval: config.heartbeatInterval || 5000, // 5s
failureThreshold: config.failureThreshold || 3, // 3 misses
...config,
};
this.consecutiveFailures = 0;
this.currentPrimary = primaryUrl;
this.isFailedOver = false;
}
async start() {
setInterval(() => this.checkHealth(), this.config.heartbeatInterval);
}
async checkHealth() {
try {
const response = await fetch(`${this.currentPrimary}/health`, {
timeout: 3000, // 3s timeout
});
if (response.ok) {
// Primary is healthy
this.consecutiveFailures = 0;
console.log(
`[${new Date().toISOString()}] Primary healthy: ${this.currentPrimary}`
);
} else {
this.handleFailure();
}
} catch (error) {
// Network error, timeout, or server down
this.handleFailure();
}
}
handleFailure() {
this.consecutiveFailures++;
console.log(
`[${new Date().toISOString()}] Primary unhealthy (${this.consecutiveFailures}/${this.config.failureThreshold})`
);
if (this.consecutiveFailures >= this.config.failureThreshold) {
this.triggerFailover();
}
}
async triggerFailover() {
if (this.isFailedOver) {
console.log("Already failed over, skipping");
return;
}
console.log(`[${new Date().toISOString()}] TRIGGERING FAILOVER!`);
try {
// Step 1: Fence old primary (prevent split-brain)
await this.fenceOldPrimary();
// Step 2: Promote standby to primary
await this.promoteStandby();
// Step 3: Update routing
this.currentPrimary = this.standbyUrl;
this.isFailedOver = true;
this.consecutiveFailures = 0;
console.log(
`[${new Date().toISOString()}] Failover complete. New primary: ${this.currentPrimary}`
);
// Notify operators
await this.sendAlert("Failover completed", {
oldPrimary: this.primaryUrl,
newPrimary: this.standbyUrl,
});
} catch (error) {
console.error("Failover failed:", error);
await this.sendAlert("Failover FAILED", { error: error.message });
}
}
async fenceOldPrimary() {
// In production: disable old primary at network/firewall level
// Or send STONITH command to power management
console.log("Fencing old primary (preventing writes)...");
try {
await fetch(`${this.primaryUrl}/admin/disable`, {
method: "POST",
timeout: 2000,
});
} catch (error) {
// Old primary might be completely down, that's OK
console.log("Could not fence old primary (might be completely down)");
}
}
async promoteStandby() {
console.log("Promoting standby to primary...");
const response = await fetch(`${this.standbyUrl}/admin/promote`, {
method: "POST",
timeout: 10000, // Give it time to catch up replication
});
if (!response.ok) {
throw new Error("Failed to promote standby");
}
// Wait for promotion to complete
await this.waitForStandbyReady();
}
async waitForStandbyReady() {
const maxWait = 30000; // 30s max
const startTime = Date.now();
while (Date.now() - startTime < maxWait) {
try {
const response = await fetch(`${this.standbyUrl}/health`);
if (response.ok) {
const data = await response.json();
if (data.role === "primary") {
console.log("Standby successfully promoted to primary");
return;
}
}
} catch (error) {
// Still promoting, retry
}
await new Promise(resolve => setTimeout(resolve, 1000)); // Wait 1s
}
throw new Error("Standby promotion timeout");
}
async sendAlert(message, details) {
// In production: Send to PagerDuty, Slack, email, etc.
console.error(`ALERT: ${message}`, details);
}
}
// Usage
const monitor = new HealthMonitor(
"http://primary.db.example.com:5432",
"http://standby.db.example.com:5432",
{
heartbeatInterval: 5000, // Check every 5 seconds
failureThreshold: 3, // Failover after 3 consecutive failures
}
);
monitor.start();
// Expected timeline on failure:
// T0: Primary crashes
// T5: First health check fails (1/3)
// T10: Second health check fails (2/3)
// T15: Third health check fails (3/3) β Trigger failover
// T16: Fence old primary (1s)
// T21: Promote standby (5s)
// T22: Update routing
// Total downtime: ~22 seconds
Split-Brain Prevention with Distributed Lock
// Using etcd for distributed locking to prevent split-brain
const { Etcd3 } = require("etcd3");
class FailoverCoordinator {
constructor(etcdHosts, nodeId) {
this.etcd = new Etcd3({ hosts: etcdHosts });
this.nodeId = nodeId;
this.leaderKey = "/cluster/leader";
this.lease = null;
this.isLeader = false;
}
async tryBecomeLeader() {
try {
// Create a lease (TTL = 10 seconds)
this.lease = this.etcd.lease(10);
// Try to acquire leader lock with this lease
const result = await this.etcd
.if(this.leaderKey, "Create", "==", 0) // Only if key doesn't exist
.then(
this.etcd.put(this.leaderKey).value(this.nodeId).lease(this.lease)
)
.else(this.etcd.get(this.leaderKey))
.commit();
if (result.succeeded) {
this.isLeader = true;
console.log(`[${this.nodeId}] Became leader!`);
// Keep renewing lease to maintain leadership
this.startLeaseRenewal();
return true;
} else {
const currentLeader = result.responses[0].kvs[0].value.toString();
console.log(
`[${this.nodeId}] Failed to become leader. Current leader: ${currentLeader}`
);
return false;
}
} catch (error) {
console.error(`[${this.nodeId}] Error acquiring leadership:`, error);
return false;
}
}
async startLeaseRenewal() {
// Renew lease every 5 seconds (TTL is 10s, so we have buffer)
this.renewalInterval = setInterval(async () => {
try {
await this.lease.keepaliveOnce();
console.log(`[${this.nodeId}] Lease renewed`);
} catch (error) {
console.error(
`[${this.nodeId}] Failed to renew lease, losing leadership`
);
this.isLeader = false;
clearInterval(this.renewalInterval);
// Try to become leader again
setTimeout(() => this.tryBecomeLeader(), 1000);
}
}, 5000);
}
async stepDown() {
if (this.lease) {
await this.lease.revoke();
clearInterval(this.renewalInterval);
}
this.isLeader = false;
console.log(`[${this.nodeId}] Stepped down from leadership`);
}
canWrite() {
// Only leader can write (prevents split-brain)
return this.isLeader;
}
}
// Usage on Primary and Standby nodes:
// Primary node
const primary = new FailoverCoordinator(["localhost:2379"], "node-primary");
await primary.tryBecomeLeader(); // Acquires lock
// Standby node
const standby = new FailoverCoordinator(["localhost:2379"], "node-standby");
await standby.tryBecomeLeader(); // Fails (primary holds lock)
// Simulate primary failure (lease expires after 10s without renewal)
// ... network partition or crash ...
// After 10s, standby can acquire lock
await standby.tryBecomeLeader(); // Succeeds! Becomes new leader
// If old primary comes back:
await primary.tryBecomeLeader(); // Fails! Standby is now leader
// Old primary cannot write without leadership lock β
Related Content
Prerequisites:
- Leader-Follower Replication - Understanding replication for standby
- Distributed Systems Basics - Foundation concepts
Related Concepts:
- Consensus - Leader election algorithms
- Quorum - Majority-based decisions for failover
- Health Checks - Failure detection mechanisms
Used In Systems:
- High-availability databases (PostgreSQL, MySQL, Redis)
- Distributed coordination (ZooKeeper, etcd)
- Message brokers (Kafka controller election)
Explained In Detail:
- Distributed Systems Deep Dive - Failover patterns in depth
Quick Self-Check
- Can explain failover in 60 seconds?
- Understand difference between hot/warm/cold standby?
- Can explain RTO vs RPO trade-offs?
- Understand split-brain problem and prevention techniques?
- Know failure detection with heartbeats and thresholds?
- Can design failover for a production database?