Skip to content

Log-Based Storage

10 min Intermediate Storage Interview: 75%

How distributed systems use append-only logs for durable, ordered, and high-throughput data storage with time-travel and replay capabilities

💼 75% of database interviews
Interview Relevance
75% of database interviews
🏭 Kafka, databases, consensus
Production Impact
Powers systems at Kafka, databases, consensus
1000x faster sequential writes
Performance
1000x faster sequential writes query improvement
📈 Trillions of messages
Scalability
Trillions of messages

TL;DR

Log-based storage uses an append-only, immutable sequence of records (a log) as the primary data structure. Records are written sequentially to the end of the log, never modified in place. This design enables high write throughput, simple crash recovery, built-in time-travel, and forms the foundation of systems like Kafka, databases, and distributed consensus.

Visual Overview

LOG-BASED STORAGE STRUCTURE:

Append-Only Log (Partition 0)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Offset 0│ Offset 1│ Offset 2│ Offset 3│ Offset 4│ Offset 5│
│ Record A│ Record B│ Record C│ Record D│ Record E│ Record F│
│ Time: T0│ Time: T1│ Time: T2│ Time: T3│ Time: T4│ Time: T5│
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

                                            New writes append here

KEY PROPERTIES:
├── Append-only: New records added to end, never modified
├── Immutable: Records never changed once written
├── Ordered: Each record has monotonically increasing offset
├── Sequential: Physical writes are sequential on disk
└── Time-indexed: Records naturally ordered by time

DISK LAYOUT (Segment-Based):

Partition Directory:
├── 00000000000000000000.log  (Segment 1: offsets 0-999)
├── 00000000000000000000.index (Index for segment 1)
├── 00000000000001000000.log  (Segment 2: offsets 1000-1999)
├── 00000000000001000000.index (Index for segment 2)
└── 00000000000002000000.log  (Active segment, offsets 2000+)

Each segment file:
- Fixed max size (e.g., 1 GB)
- Sequential writes (O(1) append)
- Independent deletion (retention)

Core Explanation

What is Log-Based Storage?

A log in distributed systems is an append-only, totally-ordered sequence of records. Think of it as an immutable array where:

  • Append-only: Records added to the end, never inserted in middle
  • Immutable: Once written, records never change
  • Ordered: Each record has a unique sequential offset (0, 1, 2, …)
  • Durable: Persisted to disk before acknowledging write
Traditional Database (In-Place Updates):
┌──────────────┐
│ User Table   │
├──────────────┤
│ id: 123      │  ← UPDATE changes this record in place
│ name: "John" │  ← Old value lost forever
│ age: 30      │  ← No history preserved
└──────────────┘

Log-Based Storage (Append-Only):
┌────────────────────────────────────────────────┐
│ Log: User Events                               │
├────────────────────────────────────────────────┤
│ [0] UserCreated(id=123, name="John", age=25)  │
│ [1] UserUpdated(id=123, age=26)               │
│ [2] UserUpdated(id=123, age=30)               │  ← New event appended
└────────────────────────────────────────────────┘

     Complete history preserved, can replay to any point

Why Sequential Writes Are Fast

Disk Performance Characteristics:

Operation Type          Throughput       Latency
─────────────────────────────────────────────────
Random Writes (HDD)     ~100 writes/sec  10ms
Sequential Writes (HDD) ~100,000/sec     0.01ms
Sequential Writes (SSD) ~500,000/sec     0.002ms

Sequential writes are 1000x faster! 🚀

How Log Storage Achieves Sequential Writes:

Traditional B-Tree Index (Random Writes):
Write "user_456" →
  1. Seek to index page (random disk seek)
  2. Read page into memory
  3. Modify page
  4. Write page back (random write)
  Result: 10ms per write

Log-Based Storage (Sequential Writes):
Write "user_456" →
  1. Append to end of current segment file
  2. Flush to disk sequentially
  Result: 0.01ms per write

100x-1000x faster!

Log Segments and Retention

Why Segments?

Instead of one giant log file, logs are split into segments:

WHY SEGMENTS?

Problem with Single Log File:
┌─────────────────────────────────────┐
│ single-log-file.log (100 GB)       │
│                                     │
│ Can't delete old data without      │
│ rewriting entire file!             │
└─────────────────────────────────────┘

Solution with Segments:
├── segment-0.log (1 GB) ← Delete this entire file (old data)
├── segment-1.log (1 GB) ← Keep
├── segment-2.log (1 GB) ← Keep
└── segment-3.log (500 MB, active) ← Currently writing

Deletion = O(1) file delete, no rewriting!

Segment Management:

// Segment configuration
segment.bytes = 1073741824  // 1 GB per segment
segment.ms = 604800000      // New segment every 7 days (whichever comes first)

// Retention strategies
log.retention.bytes = 107374182400  // Keep 100 GB total
log.retention.ms = 604800000         // OR keep 7 days (whichever hits first)

// Deletion process (runs periodically)
for (Segment segment : segments) {
  if (segment.isExpired() || totalSize > maxSize) {
    segment.delete();  // Simple file deletion, O(1)
  }
}

Indexing for Fast Reads

The Challenge:

Problem: How to find offset 12,345,678 quickly in a 100 GB log?

Naive approach:
  - Read sequentially from beginning
  - Time: O(n), could take minutes!

The Solution: Sparse Index

SPARSE INDEX STRUCTURE:

Segment: 00000000000010000000.log (offsets 10M-11M)

Index File: 00000000000010000000.index
┌──────────────────────────────────────┐
│ Offset → File Position               │
├──────────────────────────────────────┤
│ 10,000,000 → byte 0                 │  ← Index every N records
│ 10,010,000 → byte 52,428,800        │  ← (e.g., every 10K)
│ 10,020,000 → byte 104,857,600       │
│ 10,030,000 → byte 157,286,400       │
│ ...                                  │
└──────────────────────────────────────┘

To find offset 10,025,000:
1. Binary search index → Find 10,020,000 at byte 104,857,600
2. Seek to byte 104,857,600
3. Scan sequentially 5,000 records (fast, in memory)

Result: O(log n) index lookup + small sequential scan

Time-Travel and Replay

Core Feature of Log Storage:

LOG WITH TIME-INDEXED DATA:

[0] UserCreated(id=123) at 2025-01-01 10:00
[1] UserUpdated(id=123, age=26) at 2025-01-02 11:00
[2] UserUpdated(id=123, age=30) at 2025-01-03 12:00
[3] UserDeleted(id=123) at 2025-01-04 13:00

QUERIES:
- "What was user 123's state on 2025-01-02?"
  → Replay log up to offset 1 → {id=123, name="John", age=26}

- "Rebuild entire database from scratch"
  → Replay log from offset 0 → Full reconstruction

- "Reprocess last 7 days with new business logic"
  → seekToTimestamp(7 days ago) → Replay with new code

Production Use Case:

Bug discovered in analytics pipeline:
1. Pipeline processed 1 million events with incorrect logic
2. Traditional system: Data corrupted, hard to fix
3. Log-based system:
   - Keep log intact
   - seekToTimestamp(bug_introduced_time)
   - Replay events with corrected code
   - Output to new table
   - Validate and switch
   Result: Zero data loss, safe recovery

Compaction: Log Cleanup with Retention

Time-Based Retention:

Delete segments older than N days:

Day 1:  [Seg 0][Seg 1][Seg 2][Seg 3]
Day 8:  [Seg 1][Seg 2][Seg 3][Seg 4]  ← Seg 0 deleted (> 7 days old)
Day 15: [Seg 2][Seg 3][Seg 4][Seg 5]  ← Seg 1 deleted

Size-Based Retention:

Keep only last 100 GB:

[Seg 0: 20GB][Seg 1: 20GB][Seg 2: 20GB][Seg 3: 20GB][Seg 4: 30GB]
Total: 110 GB → Delete Seg 0 → 90 GB ✓

Log Compaction (Key-Based):

BEFORE COMPACTION (All events preserved):
[0] user_123: {name: "John", age: 25}
[1] user_456: {name: "Jane", age: 30}
[2] user_123: {name: "John", age: 26}  ← Update
[3] user_123: {name: "John", age: 30}  ← Update
[4] user_456: {name: "Jane", age: 31}  ← Update

AFTER COMPACTION (Only latest per key):
[2] user_123: {name: "John", age: 30}  ← Latest for 123
[4] user_456: {name: "Jane", age: 31}  ← Latest for 456

Result: Same final state, 60% smaller log
Use case: Changelog topics, database snapshots

Tradeoffs

Advantages:

  • ✓ Extremely high write throughput (sequential I/O)
  • ✓ Simple crash recovery (just find last valid offset)
  • ✓ Built-in audit trail and time-travel
  • ✓ Easy replication (just copy log segments)
  • ✓ Immutability eliminates update anomalies

Disadvantages:

  • ✕ Slow point reads without good indexing
  • ✕ Space amplification (old versions kept until deletion)
  • ✕ Range queries require scanning
  • ✕ Compaction overhead for key-based retention

Real Systems Using This

Apache Kafka

  • Implementation: Partitioned, replicated logs as primary abstraction
  • Scale: 7+ trillion messages/day at LinkedIn
  • Segments: 1 GB segments, time or size-based retention
  • Typical Setup: 7 day retention, 100+ partitions

Database Write-Ahead Logs (WAL)

  • PostgreSQL WAL: All changes written to log before data files
  • MySQL Binlog: Replication and point-in-time recovery
  • Redis AOF: Append-only file for durability
  • Purpose: Crash recovery, replication, backups

Distributed Consensus (Raft, Paxos)

  • Implementation: Replicated log of commands
  • Purpose: Ensure all nodes apply same operations in same order
  • Examples: etcd, Consul, ZooKeeper

Event Sourcing Systems

  • EventStore: Specialized event sourcing database
  • Axon Framework: CQRS/ES on top of logs
  • Purpose: Complete audit trail, temporal queries

When to Use Log-Based Storage

✓ Perfect Use Cases

High-Throughput Writes

Scenario: Ingesting millions of events per second
Why logs: Sequential writes max out disk bandwidth
Example: IoT sensor data, clickstream analytics

Audit and Compliance

Scenario: Financial transactions requiring complete audit trail
Why logs: Immutable, ordered history of all changes
Example: Banking transactions, healthcare records

Event Sourcing / CQRS

Scenario: Need to rebuild state from historical events
Why logs: Natural event stream with replay capability
Example: E-commerce order processing, booking systems

Stream Processing

Scenario: Real-time data pipelines and transformations
Why logs: Natural fit for streaming frameworks
Example: Fraud detection, real-time recommendations

✕ When NOT to Use

Point Queries Without Indexing

Problem: "Find user by email address"
Issue: Must scan entire log or build secondary index
Alternative: Use indexed database (B-tree, hash index)

Frequent Updates to Same Key

Problem: Updating same sensor reading every 10ms
Issue: Millions of log entries for same key
Alternative: In-memory cache + periodic snapshots

Need to Delete Individual Records (GDPR)

Problem: "Delete all data for user_123"
Issue: Logs are append-only, can't delete from middle
Alternative: Tombstone records + compaction, or use mutable database

Interview Application

Common Interview Question 1

Q: “Why does Kafka achieve such high throughput compared to traditional message queues?”

Strong Answer:

“Kafka’s high throughput comes from its log-based storage design that exploits sequential I/O:

  1. Sequential writes: All writes append to the end of a log segment file, achieving 100K+ writes/sec on HDDs vs ~100/sec for random writes in traditional MQs
  2. Zero-copy transfers: Kafka uses sendfile() to transfer data from disk → OS page cache → network socket without copying to application memory
  3. Batching: Producers batch multiple messages, consumers fetch in batches, amortizing overhead
  4. No per-message disk seeks: Traditional MQs update indices and metadata per message; Kafka just appends

LinkedIn achieves 7+ trillion messages/day using this design - sequential I/O is the key.”

Why this is good:

  • Specific technical reasons
  • Quantifies performance difference
  • Compares to alternatives
  • Cites real-world scale

Common Interview Question 2

Q: “How would you design a system that can replay the last 30 days of events after discovering a bug in processing logic?”

Strong Answer:

“I’d use log-based storage with time-based retention:

Architecture:

  • Events written to Kafka topic with 30-day retention
  • Processing pipeline consumes from topic
  • Outputs to versioned tables (e.g., analytics_v1, analytics_v2)

Bug recovery process:

  1. Keep old pipeline running (serves current traffic)
  2. Deploy fixed pipeline as NEW consumer group
  3. Use consumer.offsetsForTimes() to seek to 30 days ago
  4. Replay events with corrected logic to analytics_v2 table
  5. Validate results, then switch traffic to v2
  6. Delete old consumer group and v1 table

Key decisions:

  • Separate consumer groups = independent offset tracking
  • Versioned outputs = safe validation before cutover
  • Log retention = enables replay without impacting production

This is exactly how teams at Uber and Netflix do data pipeline repairs.”

Why this is good:

  • Complete architecture design
  • Step-by-step process
  • Explains key design decisions
  • Zero-downtime approach
  • Real-world examples

Red Flags to Avoid

  • ✕ Confusing logs with log files (text files)
  • ✕ Not understanding sequential vs random I/O performance difference
  • ✕ Thinking logs are just for debugging
  • ✕ Not knowing about segments and retention strategies

Quick Self-Check

Before moving on, can you:

  • Explain log-based storage in 60 seconds?
  • Draw the structure of a log with segments?
  • Explain why sequential writes are 1000x faster?
  • Describe how to find a specific offset quickly?
  • Identify when to use vs NOT use log-based storage?
  • Explain time-travel and replay capabilities?

Prerequisites

None - this is a foundational storage concept

Used In Systems

Explained In Detail


Next Recommended: Event Sourcing - See how to build applications using log-based storage