Skip to content

Health Checks

Failure detection mechanisms in distributed systems: how to determine if a node is alive, dead, or just slow, enabling automatic failover and self-healing systems

TL;DR

Health checks are probes that determine if a service is alive and functioning correctly. They enable load balancers to route traffic away from failing nodes, orchestrators to restart unhealthy containers, and distributed systems to trigger failover. The key trade-off: faster detection means more false positives.

Visual Overview

Health Check Overview

Why Health Checks Matter

The Fundamental Problem

How do you know if a remote node is dead or just slow?

ScenarioNetwork ResponseReality
Node crashedTimeoutDead
Node overloadedTimeoutAlive but struggling
Network partitionTimeoutAlive but unreachable
GC pauseTimeout then respondsAlive

All look the same to the caller: No response within timeout.

Impact of Getting It Wrong

DetectionFalse PositiveFalse Negative
Too aggressiveHealthy nodes marked dead, cascading restarts-
Too conservative-Dead nodes continue receiving traffic

Health Check Patterns

1. HTTP Health Endpoints

# Simple liveness check
@app.route('/health/live')
def liveness():
    return {'status': 'alive'}, 200

# Readiness check with dependencies
@app.route('/health/ready')
def readiness():
    if not db.is_connected():
        return {'status': 'not ready', 'reason': 'DB unavailable'}, 503
    if not cache.is_connected():
        return {'status': 'not ready', 'reason': 'Cache unavailable'}, 503
    return {'status': 'ready'}, 200

2. TCP Health Checks

For non-HTTP services:

  • Open TCP connection to port
  • Success = port responding
  • Used by: AWS ELB, HAProxy

3. gRPC Health Checking Protocol

service Health {
  rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
  rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}

message HealthCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
  }
  ServingStatus status = 1;
}

Heartbeat Protocols

Push-Based (Heartbeats)

Node periodically sends “I’m alive” to monitor.

Pros: Lower monitor load Cons: Dead node = silence (must distinguish from network issues)

Pull-Based (Polling)

Monitor periodically checks each node.

Pros: Centralized view Cons: Monitor overload at scale

Gossip-Based

Nodes share health info peer-to-peer.

Pros: Scalable, no single point of failure Cons: Eventually consistent detection Used by: Cassandra, Consul

Phi Accrual Failure Detector

Instead of binary alive/dead, calculate probability of failure:

Phi (φ) = -log10(P(heartbeat delay))

φ = 1  10% chance of failure
φ = 2  1% chance of failure
φ = 8  0.000001% chance of failure

Threshold: Mark dead when φ > 8

Advantage: Adapts to network conditions automatically. Used by: Cassandra, Akka

Kubernetes Health Probes

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 10    # Wait after start
      periodSeconds: 5           # Check every 5s
      timeoutSeconds: 3          # Timeout per check
      failureThreshold: 3        # Failures before restart

    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      periodSeconds: 5
      failureThreshold: 1        # Remove from service immediately

Configuration Trade-offs

ParameterLower ValueHigher Value
Check intervalFaster detection, more loadSlower detection, less load
TimeoutMore false positivesMisses slow failures
Failure thresholdQuick failoverTolerates transient issues

Production Recommendations

# Typical production settings
liveness:
  interval: 10s
  timeout: 5s
  failureThreshold: 3      # 30s to restart

readiness:
  interval: 5s
  timeout: 3s
  failureThreshold: 1      # Immediate traffic removal

Anti-Patterns

1. Health Check Does Too Much

# BAD: Health check that takes 30 seconds
@app.route('/health')
def health():
    run_full_database_integrity_check()  # Takes 30s!
    return {'status': 'healthy'}

Health checks should be fast (under 100ms).

2. No Dependency Isolation

# BAD: All dependencies fail readiness
@app.route('/health/ready')
def ready():
    check_database()     # Critical
    check_analytics_db() # Not critical for serving traffic

Only check critical dependencies in readiness.

3. Cascading Failures

If health check fails under load → more load on remaining nodes → they fail too.

Solution: Circuit breakers, gradual rollout, load shedding.

Prerequisites:

Related Concepts:

Used In Systems:

  • Kubernetes (liveness/readiness probes)
  • AWS ELB/ALB (target health checks)
  • Consul (service health checks)
  • Every HA deployment

Next Recommended: Failover - Learn what happens after detecting a failure

Interview Notes
💼60% of production-focused interviews
Interview Relevance
60% of production-focused interviews
🏭Every HA system
Production Impact
Powers systems at Every HA system
Detection latency vs false positives
Performance
Detection latency vs false positives query improvement
📈O(N) or O(N²) depending on protocol
Scalability
O(N) or O(N²) depending on protocol