Failover | Concepts

TL;DR

Failover is the process of automatically transferring operations from a failed primary system to a standby backup, minimizing downtime and ensuring service continuity. It’s essential for high-availability systems, enabling recovery from hardware failures, software crashes, and network issues within seconds to minutes.

Visual Overview

NORMAL OPERATION (Leader-Follower)
┌────────────────────────────────────────────────────┐
│                                                    │
│  Clients → Primary/Leader                          │
│               ↓                                    │
│          [Replicating]                             │
│               ↓                                    │
│          Standby/Follower (Hot/Warm/Cold)          │
│                                                    │
│  Status: Primary is HEALTHY ✓                      │
└────────────────────────────────────────────────────┘

FAILURE DETECTION
┌────────────────────────────────────────────────────┐
│  Heartbeat Monitor                                 │
│      ↓                                             │
│  Primary ✗ (No heartbeat for 10 seconds)          │
│      ↓                                             │
│  Failure Threshold Exceeded                        │
│      ↓                                             │
│  TRIGGER FAILOVER!                                 │
└────────────────────────────────────────────────────┘

FAILOVER PROCESS
┌────────────────────────────────────────────────────┐
│  T0: Primary fails (crash, network partition)      │
│  T1: Health checker detects failure (5-10s)        │
│  T2: Standby promoted to new primary (2-5s)        │
│  T3: Clients redirected to new primary (1-2s)      │
│  T4: Service restored                              │
│                                                    │
│  Total downtime: 8-17 seconds                      │
│                                                    │
│  New topology:                                     │
│  Clients → New Primary (old Standby) ✓             │
│               ↓                                    │
│          Old Primary (offline) ✗                   │
└────────────────────────────────────────────────────┘

SPLIT-BRAIN PROBLEM (What Can Go Wrong)
┌────────────────────────────────────────────────────┐
│  Network Partition:                                │
│                                                    │
│  Region A         ║    Region B                    │
│  Primary ✓        ║    Standby                     │
│  (still alive)    ║    (promoted to Primary) ✓     │
│      ↓            ║        ↓                       │
│  Clients A        ║    Clients B                   │
│  write X=1        ║    write X=2                   │
│                   ║                                │
│  Result: TWO PRIMARIES = DATA DIVERGENCE ✕         │
│                                                    │
│  Solution: Fencing (kill old primary with authority)│
└────────────────────────────────────────────────────┘

Core Explanation

What is Failover?

Failover is the automatic or manual process of switching from a failed primary system to a standby backup system to maintain service availability. It’s a key component of high-availability (HA) architectures.

Key Components:

Primary/Leader: Active system handling requests
Standby/Follower: Backup system ready to take over
Health Monitor: Detects primary failures via heartbeats
Failover Orchestrator: Promotes standby to primary
Fencing Mechanism: Prevents split-brain scenarios

Standby Types

1. Hot Standby (Active-Passive)

Setup:
Primary: Handles all traffic
Standby: Fully synced, ready to serve immediately

Replication: Continuous, synchronous or near-sync
Failover Time: Seconds (fastest)
Cost: High (standby hardware running idle)

Example:
Primary DB: PostgreSQL with streaming replication
Standby DB: Replica in sync, can become primary instantly

Use Case: Financial systems, databases (99.99% uptime)

2. Warm Standby

Setup:
Primary: Handles all traffic
Standby: Partially synced, needs brief preparation

Replication: Periodic or continuous async
Failover Time: Minutes
Cost: Medium (standby runs but lighter workload)

Example:
Primary: Application server with session state
Standby: Server running but not in load balancer

Use Case: Web applications, moderate SLA (99.9% uptime)

3. Cold Standby

Setup:
Primary: Handles all traffic
Standby: Offline, restored from backup

Replication: Periodic backups only
Failover Time: Hours (restore from backup + configure)
Cost: Low (standby hardware can be repurposed)

Example:
Primary: Production database
Standby: Backup snapshots on S3, restore to new instance

Use Case: Development systems, cost-sensitive applications

Failure Detection

Heartbeat Monitoring:

HEARTBEAT PROTOCOL:
┌─────────────────────────────────────────────┐
│  Every 1-5 seconds:                         │
│                                             │
│  Primary → "I'm alive" → Health Monitor     │
│                                             │
│  If no heartbeat for N seconds:             │
│  - N = 10s: Aggressive (false positives)    │
│  - N = 30s: Conservative (slower failover)  │
│  - N = 10s with 3 retries: Balanced ✓       │
└─────────────────────────────────────────────┘

What to Monitor:
✓ Process is running (liveness probe)
✓ Service is responsive (readiness probe)
✓ Database connections work
✓ Disk space available
✓ CPU/Memory not exhausted

False Positive Causes:
- Temporary network glitch
- GC pause (Java stop-the-world)
- High CPU causing timeout
- Switch/router failure (not server failure)

Mitigation:
- Multiple independent monitors
- Quorum-based decision (3/5 monitors agree)
- Grace period + retries

Failover Process Steps

Step-by-Step Breakdown:

1. FAILURE DETECTION (5-30s)
   - Health monitor misses N heartbeats
   - Multiple monitors reach consensus
   - Declare primary unhealthy

2. FENCING (1-5s)
   - Prevent split-brain
   - Kill old primary (STONITH: Shoot The Other Node In The Head)
   - Revoke old primary's network access
   - Acquire distributed lock/lease

3. PROMOTION (2-10s)
   - Standby catches up replication lag (if any)
   - Standby promoted to primary role
   - Update metadata (e.g., Kafka controller registry)

4. DNS/ROUTING UPDATE (1-60s)
   - Update DNS to point to new primary
   - Or update load balancer configuration
   - Or use floating IP (instant failover)

5. CLIENT RECONNECTION (0-30s)
   - Clients detect connection failure
   - Retry with backoff
   - Discover new primary endpoint
   - Resume operations

Total Downtime: 8-135 seconds (depends on config)

RTO vs RPO

Recovery Time Objective (RTO):

RTO = Maximum acceptable downtime

Examples:
- E-commerce site: RTO = 5 minutes (lose sales)
- Payment processing: RTO = 30 seconds (critical)
- Analytics dashboard: RTO = 30 minutes (non-critical)

Factors affecting RTO:
- Standby type (hot < warm < cold)
- Failure detection time
- Promotion process complexity
- Client reconnection speed

Recovery Point Objective (RPO):

RPO = Maximum acceptable data loss

Examples:
- Bank transactions: RPO = 0 (no loss acceptable)
- User comments: RPO = 5 minutes (acceptable)
- Log aggregation: RPO = 1 hour (can lose some logs)

Factors affecting RPO:
- Replication strategy (sync vs async)
- Commit acknowledgment (quorum required?)
- Checkpoint frequency
- Backup frequency (for cold standby)

TRADEOFF:
Low RPO (sync replication) = Higher latency
High RPO (async replication) = Lower latency, more data loss

Split-Brain Problem

What is Split-Brain?

Network partition causes both nodes to think they're primary:

Before partition:
Primary A (serving traffic) ← Health Monitor → Standby B

After partition:
Primary A (still serving) ║ Standby B (promoted, now serving)
    ↓                      ║      ↓
Clients in Region A        ║  Clients in Region B
write X=1                  ║  write X=2

PROBLEM: Divergent data, corruption when partition heals

Prevention Techniques:

1. Fencing (STONITH)

Before promoting standby, KILL old primary

Methods:
- Power off via IPMI/iLO
- Network isolation (block at switch)
- Kernel panic trigger
- Forceful process termination

Use case: Shared storage systems (SAN)

2. Distributed Locks / Leases

Acquire lock before becoming primary

Example (etcd):
1. Standby tries to acquire lease: /leader
2. Only succeeds if old primary's lease expired
3. Old primary cannot write without valid lease
4. Result: At most one primary at a time ✓

Use case: Kafka controller election

3. Quorum / Witness Node

Use majority voting to determine primary

Setup: 3 nodes (A, B, C)
- A is primary (has 2/3 votes)
- Network partition: A isolated, B+C can see each other
- B+C have majority (2/3), can elect new primary
- A has minority (1/3), cannot remain primary
- Result: Only B or C can be new primary

Use case: Distributed databases (Cassandra, Riak)

4. Generation Number / Epoch

Increment version number on each failover

Primary A has epoch=5
After failover, Primary B has epoch=6
If A comes back, it sees epoch=5 < 6
A demotes itself and syncs from B

Use case: Kafka, ZooKeeper

Real Systems Using Failover

System	Failover Type	Detection Time	RTO	RPO	Split-Brain Prevention
PostgreSQL	Hot Standby (streaming replication)	10-30s	~30s	0-5s	Witness server, fencing
MySQL	Semi-sync replication	10-30s	~1min	0 (sync ack)	GTID, virtual IP
Redis Sentinel	Hot Standby (Sentinel monitors)	5-15s	~10s	0-1s	Quorum (majority of Sentinels)
Kafka Controller	Hot Standby (ZooKeeper election)	10-30s	~20s	0 (committed log)	ZooKeeper leader election, epoch
AWS RDS	Multi-AZ (automated failover)	30-60s	60-120s	0	AWS orchestration
Cassandra	Leaderless (no failover needed)	N/A	N/A	N/A	Quorum reads/writes

Case Study: PostgreSQL Failover

PostgreSQL Streaming Replication + Patroni

Architecture:
┌─────────────────────────────────────────────────┐
│  Primary (Leader)                               │
│    ↓ (continuous WAL streaming)                 │
│  Standby 1 (sync replica, zero lag)             │
│  Standby 2 (async replica, small lag)           │
│                                                 │
│  Health Monitor: Patroni (uses etcd for DCS)    │
└─────────────────────────────────────────────────┘

Failure Scenario:
1. Primary crashes (hardware failure)
2. Patroni on each node detects (10s heartbeat miss)
3. Standbys try to acquire leadership lock in etcd
4. Standby 1 (most up-to-date) acquires lock
5. Standby 1 runs pg_ctl promote
6. Standby 1 becomes new primary (~5s)
7. Patroni updates DNS or floating IP (2s)
8. Applications reconnect to new primary (~5s)

Total downtime: ~22 seconds
RPO: 0 (sync replication to Standby 1)

Configuration:
synchronous_standby_names = 'standby1'
synchronous_commit = on
wal_level = replica
max_wal_senders = 5

Result: Zero data loss, ~20s downtime

Case Study: Kafka Controller Failover

Kafka Controller Election (via ZooKeeper)

Normal Operation:
- One broker is controller (manages partition leaders)
- Controller broker has ephemeral node in ZooKeeper: /controller

Failure:
1. Controller broker crashes
2. ZooKeeper detects session timeout (6-10s)
3. ZooKeeper deletes /controller ephemeral node
4. All brokers watch /controller for changes
5. Brokers race to create /controller node
6. First to create becomes new controller
7. New controller loads cluster metadata from ZK
8. New controller sends LeaderAndIsr requests to brokers
9. Partition leaders updated, cluster operational

Total downtime: ~10-20s (partition leadership updates)
RPO: 0 (committed messages replicated to ISR)

Split-Brain Prevention:
- ZooKeeper ensures only one /controller node
- Controller Epoch incremented on each election
- Brokers reject requests with old epoch

When to Use Failover

Use Failover When:

High Availability Required

Scenario: E-commerce checkout service
Requirement: 99.95% uptime (4 hours downtime/year)
Solution: Hot standby with automatic failover
Trade-off: Cost of redundant infrastructure

Data Loss Unacceptable

Scenario: Payment transaction database
Requirement: Zero data loss (RPO = 0)
Solution: Synchronous replication to hot standby
Trade-off: Higher write latency

RTO Measured in Seconds/Minutes

Scenario: Live video streaming control plane
Requirement: Failover < 30 seconds
Solution: Hot standby with fast health checks
Trade-off: False positives from aggressive timeouts

When NOT to Use Failover:

Stateless Services

Problem: Failover is overkill
Solution: Use load balancer with multiple active instances
Example: Stateless REST APIs, web servers
Benefit: Simpler, no failover orchestration needed

Eventually Consistent Systems

Problem: Failover adds complexity without benefit
Solution: Multi-master or leaderless replication
Example: Cassandra, DynamoDB (quorum writes)
Benefit: No single point of failure, continuous availability

Cost-Sensitive Non-Critical Systems

Problem: Hot standby doubles infrastructure cost
Solution: Use cold standby (restore from backup)
Example: Development databases, analytics pipelines
Benefit: Save money, accept longer downtime

Interview Application

Common Interview Question

Q: “Design a highly available database for a payment system. How would you handle primary database failure?”

Strong Answer:

“For a payment system where data loss is unacceptable, I’d design a hot standby failover system with these characteristics:

Architecture:

Primary database with synchronous replication to hot standby

Hot standby in different availability zone (same region for low latency)

Health monitoring via Patroni or similar tool (10-second heartbeat)

Distributed coordination using etcd or ZooKeeper for split-brain prevention

Failure Handling:

Detection (10s): Patroni detects primary unresponsive after 2 missed heartbeats

Fencing (2s): Revoke old primary’s write access via network isolation

Promotion (5s): Standby promoted to primary, acquires leadership lock in etcd

Routing (5s): Update floating IP or DNS to point to new primary

Reconnection (5s): Applications retry connections, resume transactions

Guarantees:

RTO: ~27 seconds (acceptable for payment system)

RPO: 0 seconds (synchronous replication means zero data loss)

Consistency: Transactions committed to both primary and standby before ACK

Trade-offs:

Latency: Synchronous replication adds ~5-10ms to write latency

Cost: Hot standby doubles database infrastructure cost

Complexity: Patroni/etcd adds operational complexity

Split-Brain Prevention:

Use distributed lock in etcd (only one primary can hold lock)

Primary must renew lease every 5 seconds or lose write access

Generation numbers (epochs) to detect stale primaries

Real-World Example:

Similar to AWS RDS Multi-AZ: synchronous replication, automatic failover

Or PostgreSQL + Patroni (used by Zalando, widely adopted)”

Why This Answer Works:

Identifies appropriate failover type (hot standby) for use case
Explains step-by-step process with timing
Discusses RTO/RPO trade-offs explicitly
Addresses split-brain problem proactively
References real implementations

Code Example

Implementing Simple Failover with Health Checks

// Health Monitor for Failover Detection
class HealthMonitor {
  constructor(primaryUrl, standbyUrl, config) {
    this.primaryUrl = primaryUrl;
    this.standbyUrl = standbyUrl;
    this.config = {
      heartbeatInterval: config.heartbeatInterval || 5000, // 5s
      failureThreshold: config.failureThreshold || 3, // 3 misses
      ...config,
    };

    this.consecutiveFailures = 0;
    this.currentPrimary = primaryUrl;
    this.isFailedOver = false;
  }

  async start() {
    setInterval(() => this.checkHealth(), this.config.heartbeatInterval);
  }

  async checkHealth() {
    try {
      const response = await fetch(`${this.currentPrimary}/health`, {
        timeout: 3000, // 3s timeout
      });

      if (response.ok) {
        // Primary is healthy
        this.consecutiveFailures = 0;
        console.log(
          `[${new Date().toISOString()}] Primary healthy: ${this.currentPrimary}`
        );
      } else {
        this.handleFailure();
      }
    } catch (error) {
      // Network error, timeout, or server down
      this.handleFailure();
    }
  }

  handleFailure() {
    this.consecutiveFailures++;
    console.log(
      `[${new Date().toISOString()}] Primary unhealthy (${this.consecutiveFailures}/${this.config.failureThreshold})`
    );

    if (this.consecutiveFailures >= this.config.failureThreshold) {
      this.triggerFailover();
    }
  }

  async triggerFailover() {
    if (this.isFailedOver) {
      console.log("Already failed over, skipping");
      return;
    }

    console.log(`[${new Date().toISOString()}] TRIGGERING FAILOVER!`);

    try {
      // Step 1: Fence old primary (prevent split-brain)
      await this.fenceOldPrimary();

      // Step 2: Promote standby to primary
      await this.promoteStandby();

      // Step 3: Update routing
      this.currentPrimary = this.standbyUrl;
      this.isFailedOver = true;
      this.consecutiveFailures = 0;

      console.log(
        `[${new Date().toISOString()}] Failover complete. New primary: ${this.currentPrimary}`
      );

      // Notify operators
      await this.sendAlert("Failover completed", {
        oldPrimary: this.primaryUrl,
        newPrimary: this.standbyUrl,
      });
    } catch (error) {
      console.error("Failover failed:", error);
      await this.sendAlert("Failover FAILED", { error: error.message });
    }
  }

  async fenceOldPrimary() {
    // In production: disable old primary at network/firewall level
    // Or send STONITH command to power management
    console.log("Fencing old primary (preventing writes)...");

    try {
      await fetch(`${this.primaryUrl}/admin/disable`, {
        method: "POST",
        timeout: 2000,
      });
    } catch (error) {
      // Old primary might be completely down, that's OK
      console.log("Could not fence old primary (might be completely down)");
    }
  }

  async promoteStandby() {
    console.log("Promoting standby to primary...");

    const response = await fetch(`${this.standbyUrl}/admin/promote`, {
      method: "POST",
      timeout: 10000, // Give it time to catch up replication
    });

    if (!response.ok) {
      throw new Error("Failed to promote standby");
    }

    // Wait for promotion to complete
    await this.waitForStandbyReady();
  }

  async waitForStandbyReady() {
    const maxWait = 30000; // 30s max
    const startTime = Date.now();

    while (Date.now() - startTime < maxWait) {
      try {
        const response = await fetch(`${this.standbyUrl}/health`);
        if (response.ok) {
          const data = await response.json();
          if (data.role === "primary") {
            console.log("Standby successfully promoted to primary");
            return;
          }
        }
      } catch (error) {
        // Still promoting, retry
      }

      await new Promise(resolve => setTimeout(resolve, 1000)); // Wait 1s
    }

    throw new Error("Standby promotion timeout");
  }

  async sendAlert(message, details) {
    // In production: Send to PagerDuty, Slack, email, etc.
    console.error(`ALERT: ${message}`, details);
  }
}

// Usage
const monitor = new HealthMonitor(
  "http://primary.db.example.com:5432",
  "http://standby.db.example.com:5432",
  {
    heartbeatInterval: 5000, // Check every 5 seconds
    failureThreshold: 3, // Failover after 3 consecutive failures
  }
);

monitor.start();

// Expected timeline on failure:
// T0:  Primary crashes
// T5:  First health check fails (1/3)
// T10: Second health check fails (2/3)
// T15: Third health check fails (3/3) → Trigger failover
// T16: Fence old primary (1s)
// T21: Promote standby (5s)
// T22: Update routing
// Total downtime: ~22 seconds

Split-Brain Prevention with Distributed Lock

// Using etcd for distributed locking to prevent split-brain
const { Etcd3 } = require("etcd3");

class FailoverCoordinator {
  constructor(etcdHosts, nodeId) {
    this.etcd = new Etcd3({ hosts: etcdHosts });
    this.nodeId = nodeId;
    this.leaderKey = "/cluster/leader";
    this.lease = null;
    this.isLeader = false;
  }

  async tryBecomeLeader() {
    try {
      // Create a lease (TTL = 10 seconds)
      this.lease = this.etcd.lease(10);

      // Try to acquire leader lock with this lease
      const result = await this.etcd
        .if(this.leaderKey, "Create", "==", 0) // Only if key doesn't exist
        .then(
          this.etcd.put(this.leaderKey).value(this.nodeId).lease(this.lease)
        )
        .else(this.etcd.get(this.leaderKey))
        .commit();

      if (result.succeeded) {
        this.isLeader = true;
        console.log(`[${this.nodeId}] Became leader!`);

        // Keep renewing lease to maintain leadership
        this.startLeaseRenewal();

        return true;
      } else {
        const currentLeader = result.responses[0].kvs[0].value.toString();
        console.log(
          `[${this.nodeId}] Failed to become leader. Current leader: ${currentLeader}`
        );
        return false;
      }
    } catch (error) {
      console.error(`[${this.nodeId}] Error acquiring leadership:`, error);
      return false;
    }
  }

  async startLeaseRenewal() {
    // Renew lease every 5 seconds (TTL is 10s, so we have buffer)
    this.renewalInterval = setInterval(async () => {
      try {
        await this.lease.keepaliveOnce();
        console.log(`[${this.nodeId}] Lease renewed`);
      } catch (error) {
        console.error(
          `[${this.nodeId}] Failed to renew lease, losing leadership`
        );
        this.isLeader = false;
        clearInterval(this.renewalInterval);

        // Try to become leader again
        setTimeout(() => this.tryBecomeLeader(), 1000);
      }
    }, 5000);
  }

  async stepDown() {
    if (this.lease) {
      await this.lease.revoke();
      clearInterval(this.renewalInterval);
    }
    this.isLeader = false;
    console.log(`[${this.nodeId}] Stepped down from leadership`);
  }

  canWrite() {
    // Only leader can write (prevents split-brain)
    return this.isLeader;
  }
}

// Usage on Primary and Standby nodes:

// Primary node
const primary = new FailoverCoordinator(["localhost:2379"], "node-primary");
await primary.tryBecomeLeader(); // Acquires lock

// Standby node
const standby = new FailoverCoordinator(["localhost:2379"], "node-standby");
await standby.tryBecomeLeader(); // Fails (primary holds lock)

// Simulate primary failure (lease expires after 10s without renewal)
// ... network partition or crash ...

// After 10s, standby can acquire lock
await standby.tryBecomeLeader(); // Succeeds! Becomes new leader

// If old primary comes back:
await primary.tryBecomeLeader(); // Fails! Standby is now leader
// Old primary cannot write without leadership lock ✓

Prerequisites:

Leader-Follower Replication - Understanding replication for standby
Distributed Systems Basics - Foundation concepts

Related Concepts:

Consensus - Leader election algorithms
Quorum - Majority-based decisions for failover
Health Checks - Failure detection mechanisms

Used In Systems:

High-availability databases (PostgreSQL, MySQL, Redis)
Distributed coordination (ZooKeeper, etcd)
Message brokers (Kafka controller election)

Explained In Detail:

Distributed Systems Deep Dive - Failover patterns in depth

Quick Self-Check

Can explain failover in 60 seconds?
Understand difference between hot/warm/cold standby?
Can explain RTO vs RPO trade-offs?
Understand split-brain problem and prevention techniques?
Know failure detection with heartbeats and thresholds?
Can design failover for a production database?