A technique where changes are written to a durable log before being applied to the database, enabling crash recovery and replication in database systems
65% of database interviews
Powers systems at PostgreSQL, MySQL, Redis
ACID durability query improvement
Crash recovery
TL;DR
Write-Ahead Logging (WAL) is a database technique where all changes are first written to a durable append-only log before modifying the actual database. This enables crash recovery (replay the log), ACID guarantees (durability), and replication (ship the log to replicas). Used by PostgreSQL, MySQL, Redis, and virtually all production databases.
Visual Overview
WITHOUT WAL (Naive Approach - BROKEN)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Transaction: Transfer $100 β
β β
β 1. UPDATE accounts SET balance=900 WHERE id=1 β
β β (write to disk) β
β 2. CRASH! β‘ β
β 3. UPDATE accounts SET balance=1100 WHERE id=2β
β (never executed) β
β β
β Result: Money disappears! Data corrupted β β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
WITH WAL (Write-Ahead Logging)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Transaction: Transfer $100 β
β β
β STEP 1: Write to WAL first (sequential write)β
β ββββββββββββββββββββββββββββββββββββββββ β
β β WAL: [UPDATE id=1 balance=900] β β
β β WAL: [UPDATE id=2 balance=1100] β β
β β WAL: [COMMIT] β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β (fsync - durable to disk) β
β STEP 2: Acknowledge transaction β β
β β
β STEP 3: Apply to database (async, can crash) β
β - UPDATE id=1 balance=900 β
β - UPDATE id=2 balance=1100 β
β β
β If crash after STEP 2: β
β β Replay WAL on restart β β
β β Transaction completes successfully β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
WAL RECOVERY AFTER CRASH:
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Database crashes during transaction β
β β β
β On restart: β
β 1. Read WAL from last checkpoint β
β 2. Replay all committed transactions β
β 3. Roll back uncommitted transactions β
β 4. Database state restored β β
β β
β Timeline: β
β T0: Begin transaction β
β T1: Write to WAL β
β T2: fsync (durable) β
β T3: CRASH β‘ β
β T4: Restart, replay WAL β
β T5: Database consistent again β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Explanation
What is Write-Ahead Logging?
Write-Ahead Logging (WAL) is a technique where all modifications to the database are first written to a durable log before being applied to the actual data pages. This ensures that:
- Durability: Changes survive crashes (written to log = durable)
- Atomicity: All-or-nothing transactions (log has full transaction)
- Crash Recovery: Replay log to restore consistent state
- Replication: Ship log to replicas for consistency
Key Principle: βLog First, Then Applyβ
RULE: No data page can be written to disk until its log record is on disk
Why?
- Log record = intent to change
- Data page = actual change
- If data page written first and crash occurs:
β Don't know if change was committed or not
β Cannot undo/redo
With WAL:
- Log written first = have complete transaction history
- Can replay (redo) committed transactions
- Can rollback (undo) uncommitted transactions
WAL Components
1. Log Records
Log Record Structure:
ββββββββββββββββββββββββββββββββββββββββββ
β LSN: Log Sequence Number (unique ID) β
β Transaction ID: Which transaction? β
β Operation: INSERT/UPDATE/DELETE β
β Before Image: Old value (for undo) β
β After Image: New value (for redo) β
β Prev LSN: Previous record in this txn β
ββββββββββββββββββββββββββββββββββββββββββ
Example Log Records:
LSN=100: TXN=42 BEGIN
LSN=101: TXN=42 UPDATE accounts id=1 OLD=1000 NEW=900
LSN=102: TXN=42 UPDATE accounts id=2 OLD=1000 NEW=1100
LSN=103: TXN=42 COMMIT
Each record contains enough info to:
- REDO: Apply change (using NEW value)
- UNDO: Rollback change (using OLD value)
2. WAL Buffer
In-Memory Buffer for Log Records:
βββββββββββββββββββββββββββββββββββββββ
β Application writes to database β
β β β
β Generate log records β
β β β
β Append to WAL Buffer (in memory) β
β β β
β On commit: fsync() to disk β
β β β
β WAL File on disk (durable) β
βββββββββββββββββββββββββββββββββββββββ
Flush Triggers:
- Transaction commits (fsync immediately)
- WAL buffer fills up
- Checkpoint operation
- Periodic flush (e.g., every 1 second)
3. Data Pages (Database Files)
Actual database storage (modified later):
ββββββββββββββββββββββββββββββββββββββ
β WAL written to disk (durable) β
β β β
β Background process applies changesβ
β to data pages (async) β
β β β
β Data pages written to disk β
β (can be delayed for performance) β
ββββββββββββββββββββββββββββββββββββββ
Why delay?
- Sequential WAL writes are fast (600 MB/s)
- Random data page writes are slow (100 MB/s)
- Write to WAL immediately, data pages later
4. Checkpoints
Checkpoint = Flush all dirty pages to disk
ββββββββββββββββββββββββββββββββββββββββββ
β Checkpoint Process: β
β 1. Mark checkpoint start in WAL β
β 2. Write all modified data pages β
β to disk (can take seconds/minutes) β
β 3. Mark checkpoint complete in WAL β
β 4. Record checkpoint LSN β
β β
β Purpose: β
β - Limit recovery time (don't replay β
β entire WAL, just since checkpoint) β
β - Allow WAL truncation (delete old β
β log files before checkpoint) β
β β
β Frequency: Every 5-15 minutes β
ββββββββββββββββββββββββββββββββββββββββββ
Recovery after crash:
- Find last checkpoint LSN
- Replay WAL from that point forward
- Much faster than replaying full history
How WAL Enables Crash Recovery
Recovery Algorithm (ARIES-style):
Phase 1: ANALYSIS
- Scan WAL from last checkpoint
- Identify committed transactions
- Identify uncommitted transactions
- Build transaction table
Phase 2: REDO
- Replay all operations (committed + uncommitted)
- Restore database to state at crash time
- "Redo everything"
Phase 3: UNDO
- Roll back uncommitted transactions
- Use "before images" from log records
- Leave only committed transactions
Result: Database in consistent state β
Example Recovery:
WAL Contents:
LSN=100: TXN=1 BEGIN
LSN=101: TXN=1 UPDATE accounts id=1 OLD=1000 NEW=900
LSN=102: TXN=1 COMMIT β
LSN=103: TXN=2 BEGIN
LSN=104: TXN=2 UPDATE accounts id=2 OLD=1000 NEW=800
LSN=105: CRASH β‘ (TXN=2 not committed)
Recovery Process:
1. ANALYSIS:
- TXN=1 is committed
- TXN=2 is not committed
2. REDO:
- Apply LSN=101: accounts id=1 β 900
- Apply LSN=104: accounts id=2 β 800
3. UNDO:
- Roll back LSN=104: accounts id=2 β 1000 (use OLD value)
Final State:
- accounts id=1 = 900 (TXN=1 committed) β
- accounts id=2 = 1000 (TXN=2 rolled back) β
WAL for Replication
Streaming Replication:
Primary β Replica Replication via WAL
ββββββββββββββββββββββββββββββββββββββββββ
β Primary Database: β
β 1. Client writes data β
β 2. Generate WAL records β
β 3. Write WAL to local disk β
β 4. Stream WAL to replicas (async) β
β 5. Acknowledge client β
β β
β Replica Database: β
β 1. Receive WAL stream from primary β
β 2. Apply WAL records (same as recovery)β
β 3. Replay operations β same state β
β β
β Result: Replicas eventually consistentβ
β with primary (usually <1s lag) β
ββββββββββββββββββββββββββββββββββββββββββ
Synchronous Replication:
- Wait for replica to acknowledge WAL write
- Guarantees zero data loss (RPO=0)
- Higher latency (wait for network + replica)
Asynchronous Replication:
- Don't wait for replica
- Lower latency
- Possible data loss if primary fails
WAL Performance Optimization
1. Group Commit
Batch multiple transactions into single fsync():
ββββββββββββββββββββββββββββββββββββββ
β Transaction 1: Write to WAL bufferβ
β Transaction 2: Write to WAL bufferβ
β Transaction 3: Write to WAL bufferβ
β β β
β Single fsync() for all three β
β β β
β All three transactions durable β
ββββββββββββββββββββββββββββββββββββββ
Benefit:
- fsync() is expensive (~5-10ms)
- Amortize cost across multiple transactions
- Throughput: 1000s of transactions/sec possible
2. Sequential Writes
WAL is append-only (sequential writes):
- Modern disks: Sequential ~600 MB/s
- Random writes: ~100 MB/s (6x slower)
WAL exploits sequential write performance:
- Append records to end of log
- No random seeks
- Result: Very high write throughput
3. Write Caching
Use battery-backed write cache (NVRAM):
- fsync() writes to NVRAM (fast)
- NVRAM persists to disk later
- Survives power failure
- Latency: <1ms (vs 5-10ms for disk)
Used in enterprise storage systems
Real Systems Using WAL
System | WAL Name | Log Format | Use Case | Replication |
---|---|---|---|---|
PostgreSQL | WAL (Write-Ahead Log) | Binary, page-based | OLTP database | Streaming replication (WAL shipping) |
MySQL | Binary Log (binlog) | Row-based or statement | OLTP database | Master-slave replication |
Redis | AOF (Append-Only File) | Text commands | In-memory cache | AOF rewrite, RDB snapshots |
MongoDB | Oplog (Operations Log) | BSON documents | Document database | Replica set synchronization |
SQLite | WAL Mode | Binary | Embedded database | No replication |
Kafka | Partition log | Message-based | Message broker | Topic replication |
Case Study: PostgreSQL WAL
PostgreSQL WAL Architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Client executes: UPDATE users SET ... β
β β β
β 2. Generate WAL record (XLogInsert) β
β β β
β 3. Write to WAL buffer in memory β
β β β
β 4. On COMMIT: fsync WAL to disk β
β β β
β 5. Acknowledge transaction to client β β
β β β
β 6. Background writer applies to data pages β
β β β
β 7. Checkpoint flushes dirty pages β
ββββββββββββββββββββββββββββββββββββββββββββββββ
WAL File Structure:
- Files: 00000001000000000000001 (16 MB segments)
- Location: pg_wal/ directory
- Retention: Keep files needed for recovery + replication
- Archiving: Copy to backup storage for point-in-time recovery
Configuration:
wal_level = replica # Enable replication
fsync = on # Ensure durability
synchronous_commit = on # Wait for fsync
wal_buffers = 16MB # WAL buffer size
checkpoint_timeout = 5min # Checkpoint frequency
max_wal_size = 1GB # Trigger checkpoint
Case Study: Redis AOF
Redis Append-Only File (AOF):
ββββββββββββββββββββββββββββββββββββββββββ
β Every write command logged to AOF: β
β β
β SET key1 "value1" β
β INCR counter β
β LPUSH list "item" β
β β
β On restart: Replay AOF commands β
ββββββββββββββββββββββββββββββββββββββββββ
AOF Fsync Policies:
1. always: fsync after every command (safest, slowest)
2. everysec: fsync every 1 second (default, good balance)
3. no: Let OS decide (fastest, least safe)
AOF Rewrite:
- Problem: AOF grows forever (SET key1 100 times)
- Solution: Rewrite AOF with current state only
- Background process creates new AOF
- Atomic switch to new file
Example:
Before rewrite (100 commands):
SET key1 "a"
SET key1 "b"
SET key1 "c"
... (97 more)
After rewrite (1 command):
SET key1 "final_value"
Case Study: MySQL Binary Log
MySQL Binlog:
ββββββββββββββββββββββββββββββββββββββββββ
β Logs all database changes β
β Format: Row-based (default) or β
β Statement-based β
β β
β Uses: β
β - Replication (master β slave) β
β - Point-in-time recovery β
β - Audit trail β
ββββββββββββββββββββββββββββββββββββββββββ
Row-Based Replication:
- Log actual row changes (binary format)
- Example: "Change row id=42 col1=100"
- Benefit: Safe, deterministic
Statement-Based Replication:
- Log SQL statements
- Example: "UPDATE users SET ... WHERE ..."
- Benefit: Compact log size
- Problem: Non-deterministic functions (NOW(), RAND())
Configuration:
binlog_format = ROW
sync_binlog = 1 # Sync to disk on commit
binlog_expire_logs_seconds = 604800 # 7 days retention
When to Use WAL
Use WAL When:
Durability Required (ACID Compliance)
Scenario: Financial transaction database
Requirement: No data loss after commit
Solution: WAL with synchronous fsync
Trade-off: Higher write latency (~5-10ms)
Crash Recovery Needed
Scenario: Database server can crash unexpectedly
Requirement: Automatic recovery to consistent state
Solution: WAL enables automatic replay on restart
Trade-off: Recovery time proportional to log size (use checkpoints)
Replication Required
Scenario: Multi-datacenter database deployment
Requirement: Keep replicas in sync with primary
Solution: Stream WAL to replicas
Trade-off: Network bandwidth for WAL shipping
Alternatives to WAL:
Shadow Paging
Instead of WAL, use copy-on-write:
- Never modify pages in place
- Create new versions on update
- Atomic switch to new version
Example: LMDB database
Benefit: No separate log
Trade-off: More complex, fragmentation
In-Memory Only (No Durability)
Don't persist to disk at all:
- Example: Pure in-memory cache (Memcached)
- Benefit: Maximum performance
- Trade-off: Data lost on crash
Interview Application
Common Interview Question
Q: βHow does a database ensure durability while maintaining good performance?β
Strong Answer:
βDatabases use Write-Ahead Logging (WAL) to guarantee durability without sacrificing performance:
WAL Mechanism:
- Before modifying data pages, write a log record describing the change
- Append to sequential WAL file (fast: sequential writes ~600 MB/s)
- fsync() WAL to disk before acknowledging transaction (durable)
- Modify data pages later in background (can be delayed)
Why This Works:
- Sequential writes are 6x faster than random writes (WAL vs data pages)
- Group commit: Batch multiple transactions into single fsync() call
- Crash recovery: Replay WAL to reconstruct lost data page changes
- Separation of concerns: Durability (WAL) vs performance (delayed page writes)
Performance Optimizations:
- Group commit: fsync() every 10-100ms for batch of transactions
- WAL buffering: Accumulate records in memory before disk write
- Checkpoints: Periodically flush data pages, truncate old WAL
- Write caching: Use battery-backed cache for sub-millisecond fsync
Trade-offs:
- Latency: Synchronous fsync adds ~5-10ms per transaction
- Throughput: Group commit amortizes cost β 1000s txn/sec possible
- Storage: WAL requires additional disk space (usually <10% overhead)
- Recovery time: Must replay WAL from last checkpoint (use frequent checkpoints)
Real Example: PostgreSQL achieves 10,000+ TPS with WAL by:
- Group commit (default: commit_delay=0, but waits for other commits)
- 16MB WAL segments with efficient buffering
- Background writer applies changes to data pages
- 5-minute checkpoint intervals balance recovery time vs overheadβ
Code Example
Simple WAL Implementation
import os
import struct
import json
from typing import Dict, Any
class WriteAheadLog:
"""
Simplified WAL implementation for educational purposes
"""
def __init__(self, wal_path: str, data_path: str):
self.wal_path = wal_path
self.data_path = data_path
self.data: Dict[str, Any] = {}
self.wal_file = None
self.lsn = 0 # Log Sequence Number
# Recovery on initialization
self.recover()
def recover(self):
"""Replay WAL to restore consistent state"""
if not os.path.exists(self.wal_path):
return
print("Performing crash recovery...")
with open(self.wal_path, 'r') as f:
for line in f:
record = json.loads(line.strip())
self._replay_record(record)
print(f"Recovery complete. Replayed {self.lsn} log records")
def _replay_record(self, record):
"""Apply a single log record"""
self.lsn = record['lsn']
if record['op'] == 'BEGIN':
pass # Transaction start
elif record['op'] == 'SET':
self.data[record['key']] = record['value']
elif record['op'] == 'DELETE':
self.data.pop(record['key'], None)
elif record['op'] == 'COMMIT':
pass # Transaction end
def begin_transaction(self) -> int:
"""Start a new transaction"""
txn_id = self.lsn + 1
self._write_log({
'lsn': self.lsn + 1,
'txn_id': txn_id,
'op': 'BEGIN'
})
return txn_id
def set(self, key: str, value: Any, txn_id: int):
"""Set a key-value pair (write to WAL first)"""
# STEP 1: Write to WAL (not yet durable)
self._write_log({
'lsn': self.lsn + 1,
'txn_id': txn_id,
'op': 'SET',
'key': key,
'value': value,
'old_value': self.data.get(key) # For undo
})
# STEP 2: Apply to in-memory data
self.data[key] = value
def delete(self, key: str, txn_id: int):
"""Delete a key (write to WAL first)"""
self._write_log({
'lsn': self.lsn + 1,
'txn_id': txn_id,
'op': 'DELETE',
'key': key,
'old_value': self.data.get(key)
})
if key in self.data:
del self.data[key]
def commit(self, txn_id: int):
"""Commit transaction (make durable)"""
# Write COMMIT record
self._write_log({
'lsn': self.lsn + 1,
'txn_id': txn_id,
'op': 'COMMIT'
})
# CRITICAL: fsync() to ensure durability
self._sync_wal()
print(f"Transaction {txn_id} committed (durable)")
def _write_log(self, record):
"""Append record to WAL"""
if self.wal_file is None:
self.wal_file = open(self.wal_path, 'a')
self.lsn = record['lsn']
self.wal_file.write(json.dumps(record) + '\n')
# Note: Not yet synced to disk (buffered)
def _sync_wal(self):
"""Force WAL to disk (fsync)"""
if self.wal_file:
self.wal_file.flush()
os.fsync(self.wal_file.fileno())
# Now durable! Survives crash
def checkpoint(self):
"""Write data pages to disk, truncate WAL"""
print("Starting checkpoint...")
# Write current state to data file
with open(self.data_path, 'w') as f:
json.dump(self.data, f)
f.flush()
os.fsync(f.fileno())
# Truncate WAL (all data now in data file)
if self.wal_file:
self.wal_file.close()
self.wal_file = None
with open(self.wal_path, 'w') as f:
f.truncate()
print("Checkpoint complete. WAL truncated")
def get(self, key: str) -> Any:
"""Read a key (from in-memory data)"""
return self.data.get(key)
def close(self):
"""Clean shutdown"""
if self.wal_file:
self.wal_file.close()
# Usage Example
if __name__ == '__main__':
db = WriteAheadLog('db.wal', 'db.data')
# Transaction 1: Transfer money
txn1 = db.begin_transaction()
db.set('account_1', 900, txn1)
db.set('account_2', 1100, txn1)
db.commit(txn1) # fsync() here - durable!
print(f"Account 1: {db.get('account_1')}")
print(f"Account 2: {db.get('account_2')}")
# Simulate crash (don't checkpoint)
# If you kill process here and restart:
# WAL replay will restore state!
# Checkpoint (write to disk, truncate WAL)
db.checkpoint()
db.close()
# Simulating crash recovery:
# 1. Run script once (creates WAL)
# 2. Kill process before checkpoint
# 3. Restart script
# 4. WAL automatically replayed, data restored β
WAL with Group Commit
import time
import threading
from queue import Queue
from typing import List
class WALWithGroupCommit:
"""
WAL with group commit optimization
"""
def __init__(self, wal_path: str, commit_delay_ms: int = 10):
self.wal_path = wal_path
self.commit_delay_ms = commit_delay_ms / 1000.0
self.wal_file = open(wal_path, 'a')
# Pending commits waiting for group commit
self.pending_commits: Queue = Queue()
# Start group commit background thread
self.group_commit_thread = threading.Thread(
target=self._group_commit_worker,
daemon=True
)
self.group_commit_thread.start()
def write_record(self, record: dict):
"""Write log record (buffered, not yet durable)"""
self.wal_file.write(json.dumps(record) + '\n')
def commit_async(self) -> threading.Event:
"""
Commit transaction asynchronously.
Returns event that will be set when durable.
"""
commit_event = threading.Event()
self.pending_commits.put(commit_event)
return commit_event
def _group_commit_worker(self):
"""Background thread that performs group commits"""
while True:
time.sleep(self.commit_delay_ms)
# Collect all pending commits
commits_to_sync: List[threading.Event] = []
while not self.pending_commits.empty():
commits_to_sync.append(self.pending_commits.get())
if commits_to_sync:
# Single fsync for all pending commits!
self.wal_file.flush()
os.fsync(self.wal_file.fileno())
print(f"Group commit: {len(commits_to_sync)} transactions synced")
# Notify all waiting transactions
for event in commits_to_sync:
event.set()
def commit(self):
"""Synchronous commit (waits for fsync)"""
event = self.commit_async()
event.wait() # Block until group commit completes
# Usage
wal = WALWithGroupCommit('db.wal', commit_delay_ms=10)
# Multiple transactions can commit concurrently
# All will be synced in single fsync()
def transaction(txn_id):
wal.write_record({'txn_id': txn_id, 'op': 'UPDATE'})
wal.commit() # Waits for group commit
print(f"Transaction {txn_id} durable")
# Run 100 concurrent transactions
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
futures = [executor.submit(transaction, i) for i in range(100)]
concurrent.futures.wait(futures)
# Output: Group commit batches transactions together
# Instead of 100 fsync() calls, maybe only 10-20 needed
Related Content
Prerequisites:
- ACID Transactions - Understanding durability
- Distributed Systems Basics - Foundation concepts
Related Concepts:
- Log-Based Storage - Append-only storage systems
- Immutability - WAL is immutable log
- Checkpointing - Periodic state snapshots
- Replication - WAL shipping for replicas
Used In Systems:
- PostgreSQL, MySQL, SQLite (database WAL)
- Redis (AOF - Append-Only File)
- Kafka (partition log is essentially WAL)
Explained In Detail:
- Database Internals Deep Dive - WAL implementation details
Quick Self-Check
- Can explain WAL in 60 seconds?
- Understand why log is written before data pages?
- Know how WAL enables crash recovery (REDO/UNDO)?
- Can explain checkpointing and WAL truncation?
- Understand group commit optimization?
- Know how WAL enables replication?