Skip to content

Sharding

10 min Intermediate Storage Interview: 80%

How databases horizontally partition data across multiple servers for scalability, using partition keys to distribute and route data efficiently

⭐ Must-Know
πŸ’Ό 80% of system design interviews
Interview Relevance
80% of system design interviews
🏭 MongoDB, Cassandra, Instagram
Production Impact
Powers systems at MongoDB, Cassandra, Instagram
⚑ High throughput
Performance
High throughput query improvement
πŸ“ˆ Thousands to billions of records
Scalability
Thousands to billions of records

TL;DR

Sharding is a database architecture pattern that horizontally partitions data across multiple servers (shards), where each shard holds a subset of the total data. A shard key determines which shard stores each record, enabling systems to scale beyond single-server capacity while maintaining reasonable query performance.

Visual Overview

WITHOUT SHARDING (Single Database)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Single Database Server       β”‚
β”‚  Users 1-1,000,000                 β”‚
β”‚  Storage: 500GB                    β”‚
β”‚  CPU: 100% (bottleneck)            β”‚
β”‚  Throughput: 5K queries/sec MAX    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   ↓
        Limited by single server capacity


WITH SHARDING (4 Shards)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    APPLICATION LAYER                        β”‚
β”‚            (Routes queries based on shard key)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚         β”‚                 β”‚          β”‚
        β–Ό         β–Ό                 β–Ό          β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Shard 1 β”‚ β”‚ Shard 2 β”‚   β”‚ Shard 3 β”‚ β”‚ Shard 4 β”‚
   β”‚Users    β”‚ β”‚Users    β”‚   β”‚Users    β”‚ β”‚Users    β”‚
   β”‚1-250K   β”‚ β”‚250K-500Kβ”‚   β”‚500K-750Kβ”‚ β”‚750K-1M  β”‚
   β”‚         β”‚ β”‚         β”‚   β”‚         β”‚ β”‚         β”‚
   β”‚125GB    β”‚ β”‚125GB    β”‚   β”‚125GB    β”‚ β”‚125GB    β”‚
   β”‚CPU: 25% β”‚ β”‚CPU: 25% β”‚   β”‚CPU: 25% β”‚ β”‚CPU: 25% β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total Throughput: 20K queries/sec (4x improvement)
Total Storage: 500GB (distributed)
Horizontal Scalability: Add more shards to scale

Core Explanation

What is Sharding?

Sharding is a database partitioning technique that splits a large dataset horizontally across multiple independent databases (shards). Each shard contains a unique subset of the data, determined by a shard key (also called partition key).

Think of sharding like dividing a massive library across multiple buildings:

  • Single library (no sharding): All 10 million books in one building, limited by building capacity
  • Sharded library (4 buildings): Books divided alphabetically - Building A holds A-F, Building B holds G-L, etc. Each building handles fewer books and concurrent visitors

Key Characteristics:

  • Horizontal partitioning: Splits data by rows, not columns
  • Shard key: Determines which shard stores each record
  • Independent shards: Each shard is a separate database server
  • Distributed queries: Queries may hit one shard (ideal) or multiple shards

Shard Key Selection

The shard key is the most critical design decision in sharding. A good shard key:

  1. Distributes data evenly: Avoids β€œhot shards” with disproportionate load
  2. Aligns with query patterns: Minimizes cross-shard queries
  3. Has high cardinality: Many unique values to distribute across shards
  4. Is immutable: Changing a shard key requires data migration

Example: E-commerce Database

OPTION 1: Shard by user_id (GOOD for user queries)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Shard Key: user_id                                         β”‚
β”‚ Hash(user_id) % 4 = shard_number                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Shard 1: Users 1-250K     β†’ All data for user 12345       β”‚
β”‚ Shard 2: Users 250K-500K  β†’ All data for user 340987      β”‚
β”‚ Shard 3: Users 500K-750K  β†’ All data for user 655432      β”‚
β”‚ Shard 4: Users 750K-1M    β†’ All data for user 923456      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βœ“ User profile query: Hits 1 shard (fast)
βœ“ User orders query: Hits 1 shard (all user's orders on same shard)
βœ• Global analytics: Must query all 4 shards (slow)


OPTION 2: Shard by product_category (BAD - uneven distribution)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Shard 1: Electronics (70% of traffic, HOT SHARD)          β”‚
β”‚ Shard 2: Books (15% of traffic)                           β”‚
β”‚ Shard 3: Clothing (10% of traffic)                        β”‚
β”‚ Shard 4: Home (5% of traffic)                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βœ• Uneven load: Shard 1 is overloaded
βœ• Hot shard becomes bottleneck
βœ• Scales poorly (can't split popular category easily)

Shard Key Strategies

1. Hash-Based Sharding (Most Common)

# Distribute data evenly using hash function
shard_number = hash(user_id) % number_of_shards

# Example:
hash(user_12345) % 4 = 1  β†’ Store in Shard 1
hash(user_67890) % 4 = 3  β†’ Store in Shard 3

Pros:

  • βœ“ Even distribution (hash function randomizes)
  • βœ“ Simple to implement
  • βœ“ Predictable shard lookup

Cons:

  • βœ• Difficult to add/remove shards (rehashing required)
  • βœ• Can’t do range queries across shard key (user_id > 50000)

2. Range-Based Sharding

Shard 1: user_id 1-250,000
Shard 2: user_id 250,001-500,000
Shard 3: user_id 500,001-750,000
Shard 4: user_id 750,001-1,000,000

Pros:

  • βœ“ Supports range queries (all users with id > 500,000)
  • βœ“ Easy to add shards (just split ranges)

Cons:

  • βœ• Risk of uneven distribution (new users may cluster in recent ranges)
  • βœ• Hot shards if data is time-based (newest users get all traffic)

3. Geographic Sharding

Shard 1: Users in North America
Shard 2: Users in Europe
Shard 3: Users in Asia
Shard 4: Users in South America

Pros:

  • βœ“ Reduces latency (data close to users)
  • βœ“ Regulatory compliance (data residency requirements)
  • βœ“ Simple routing (IP-based or user-selected)

Cons:

  • βœ• Uneven distribution (more users in some regions)
  • βœ• Cross-region queries are slow

Query Routing

When a query arrives, the application layer must determine which shard(s) to query:

Single-Shard Query (Ideal - Fast)

-- Query includes shard key
SELECT * FROM orders WHERE user_id = 12345;

Routing:
1. Extract shard key: user_id = 12345
2. Calculate shard: hash(12345) % 4 = 1
3. Query only Shard 1
4. Return results

Performance: O(1) shard lookup + single database query

Multi-Shard Query (Scatter-Gather - Slow)

-- Query does NOT include shard key
SELECT * FROM orders WHERE product_id = 'XYZ123';

Routing:
1. No shard key β†’ Must query all shards
2. Send query to Shards 1, 2, 3, 4 in parallel
3. Merge results from all shards
4. Sort/paginate combined results
5. Return to client

Performance: O(N) where N = number of shards

Query Pattern Optimization:

GOOD: Include shard key in WHERE clause
SELECT * FROM users WHERE user_id = 123;  β†’ Hits 1 shard

BAD: Query without shard key
SELECT * FROM users WHERE email = 'user@example.com';  β†’ Hits all shards

SOLUTION: Secondary index or denormalization
- Option 1: Maintain email β†’ user_id mapping in separate lookup table
- Option 2: Duplicate user data in email-sharded collection

Rebalancing and Resharding

As data grows, you may need to add more shards. This is challenging:

The Resharding Problem:

BEFORE: 4 shards
Shard 0: hash(key) % 4 = 0
Shard 1: hash(key) % 4 = 1
Shard 2: hash(key) % 4 = 2
Shard 3: hash(key) % 4 = 3

AFTER: 8 shards (doubled)
Shard 0: hash(key) % 8 = 0  ← Different assignment!
Shard 1: hash(key) % 8 = 1
...
Shard 7: hash(key) % 8 = 7

Problem: Most keys now map to different shards
Result: Must migrate ~75% of data to new locations

Solutions:

  1. Consistent Hashing: Minimizes data movement when adding shards
  2. Virtual Shards: More shards than physical servers, easier rebalancing
  3. Pre-sharding: Start with more shards than needed (e.g., 256 shards on 4 servers)

Tradeoffs

Advantages:

  • βœ“ Horizontal scalability: Add more servers to handle more data
  • βœ“ Improved throughput: Queries distributed across multiple databases
  • βœ“ Fault isolation: One shard failure doesn’t affect others
  • βœ“ Reduced latency: Smaller datasets per shard = faster queries

Disadvantages:

  • βœ• Increased complexity: Application must handle routing logic
  • βœ• Cross-shard queries are expensive: Scatter-gather operations slow
  • βœ• Transactions across shards: Difficult or impossible (need distributed transactions)
  • βœ• Rebalancing is hard: Adding/removing shards requires data migration
  • βœ• Hot shards: Poor shard key choice leads to uneven load

Real Systems Using Sharding

MongoDB (Auto-Sharding)

  • Implementation: Chunk-based sharding with automatic balancing
  • Shard Key: Chosen by user (e.g., user_id, timestamp)
  • Scale: Supports thousands of shards
  • Typical Setup: Start with 3 shards, auto-split and rebalance as data grows
MongoDB Sharding Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           mongos (Router)                β”‚
β”‚     Routes queries based on shard key    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚         β”‚         β”‚
        β–Ό         β–Ό         β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚Shard 1 β”‚ β”‚Shard 2 β”‚ β”‚Shard 3 β”‚
   β”‚Replica β”‚ β”‚Replica β”‚ β”‚Replica β”‚
   β”‚ Set    β”‚ β”‚ Set    β”‚ β”‚ Set    β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cassandra (Hash Partitioning)

  • Implementation: Consistent hashing with virtual nodes
  • Partition Key: First part of primary key
  • Scale: Designed for massive scale (Instagram uses 1000+ nodes)
  • Typical Setup: 256 virtual nodes per physical server

DynamoDB (Managed Sharding)

  • Implementation: Automatic partitioning by AWS
  • Partition Key: Required in table schema
  • Scale: Auto-scales partitions based on throughput
  • Typical Setup: Transparent to user (AWS manages shards)

Instagram (Custom Sharding)

  • Implementation: PostgreSQL with application-level sharding
  • Shard Key: user_id
  • Scale: Thousands of database servers
  • Strategy: Store all user data (photos, likes, followers) on same shard for single-shard queries

When to Use Sharding

βœ“ Perfect Use Cases

High Write Throughput

Scenario: Social media platform with 10M+ writes/second
Challenge: Single database can't handle write load
Solution: Shard by user_id, distribute writes across 100+ shards
Result: Each shard handles 100K writes/second (manageable)

Large Dataset That Doesn’t Fit on One Server

Scenario: E-commerce with 5TB product catalog + order history
Challenge: Single database server has 1TB storage
Solution: Shard by user_id or timestamp
Result: Each shard stores 500GB (5TB / 10 shards)

Read-Heavy Workload with Query Patterns

Scenario: User-facing app where 95% of queries are per-user
Solution: Shard by user_id, most queries hit single shard
Result: Linear scalability (add shards = add capacity)

βœ• When NOT to Use Sharding

Small Dataset (<100GB)

Problem: Sharding adds complexity
Alternative: Vertical scaling (bigger server) or read replicas
Threshold: Consider sharding when >500GB or >10K writes/sec

Frequent Cross-Shard Queries

Problem: Analytics queries require aggregating across all shards
Example: "Total revenue for product category X across all users"
Alternative: Separate OLAP database (e.g., data warehouse)

Need for ACID Transactions Across Entities

Problem: Sharding breaks cross-shard transactions
Example: Money transfer between users on different shards
Alternative: Keep transactional data unsharded, shard read-heavy data

Interview Application

Common Interview Question 1

Q: β€œDesign a database for Twitter. How would you shard the data?”

Strong Answer:

β€œI’d shard by user_id using hash-based partitioning. Here’s why:

Rationale:

  • Most queries are user-centric: get user’s tweets, timeline, followers
  • Sharding by user_id means all user data lives on one shard
  • Single-shard queries are fast and don’t require cross-shard operations

Shard Key: hash(user_id) % number_of_shards

Data Co-location:

  • User profile β†’ Shard X
  • User’s tweets β†’ Shard X
  • User’s followers β†’ Shard X
  • User’s timeline cache β†’ Shard X

Query Patterns:

  • Get user profile: Single-shard query βœ“
  • Get user’s tweets: Single-shard query βœ“
  • Post new tweet: Single-shard write βœ“

Cross-Shard Challenge:

  • Building home timeline (tweets from followed users) requires cross-shard queries
  • Solution: Pre-compute timelines using fan-out on write (write tweets to follower timelines)

Scaling Strategy:

  • Start with 16 shards (over-provision)
  • As users grow, add more shards using consistent hashing
  • Use virtual shards (256 virtual shards, 16 physical servers initially)”

Why This Answer Works:

  • Identifies appropriate shard key with reasoning
  • Explains query pattern optimization
  • Addresses cross-shard challenge with solution
  • Discusses scaling strategy

Common Interview Question 2

Q: β€œYour sharded database has a β€˜hot shard’ that’s getting 10x more traffic than others. How do you fix it?”

Strong Answer:

β€œHot shard indicates poor shard key distribution. Here’s how I’d address it:

Immediate Fix (Short-term):

  1. Vertical scaling: Upgrade the hot shard’s hardware temporarily
  2. Read replicas: Add read replicas for hot shard to distribute read load
  3. Caching: Cache frequently accessed data from hot shard

Root Cause Analysis:

  • Is it a specific celebrity user? (data skew)
  • Is it timestamp-based clustering? (recent data hotspot)
  • Is it a geographic region? (regional load)

Long-term Fix (Depends on cause):

If celebrity users:

  • Give top 1% users dedicated shards
  • Use composite shard key: (is_celebrity, user_id)
  • Celebrities distributed separately

If timestamp clustering:

  • Switch from range-based to hash-based sharding
  • Use: hash(user_id) instead of timestamp ranges

If geographic:

  • Further subdivide hot region
  • E.g., Split β€˜North America’ into US-East, US-West, Canada

Rebalancing Strategy:

  • Use consistent hashing to minimize data movement
  • Perform migration during low-traffic hours
  • Keep old shard online during migration (dual writes)
  • Cutover once new shard is caught up

Prevention:

  • Monitor shard metrics (CPU, throughput, latency)
  • Alert when shard imbalance >20%
  • Choose shard keys with high cardinality and even distribution”

Why This Answer Works:

  • Immediate actions + root cause analysis
  • Multiple solutions depending on scenario
  • Rebalancing strategy with minimal downtime
  • Preventive measures

Red Flags to Avoid

  • βœ• Suggesting sharding for small datasets (<100GB)
  • βœ• Not considering query patterns when choosing shard key
  • βœ• Ignoring cross-shard query challenges
  • βœ• Not explaining how to handle hot shards
  • βœ• Forgetting about rebalancing complexity

Quick Self-Check

Before moving on, can you:

  • Explain sharding in 60 seconds?
  • Draw a diagram showing data distributed across shards?
  • Explain 3 shard key strategies (hash, range, geographic)?
  • Describe the difference between single-shard and cross-shard queries?
  • Identify when to use vs NOT use sharding?
  • Explain how to handle a hot shard?

Prerequisites

None - foundational database scaling concept

Used In Systems

  • Twitter: User-sharded database
  • Instagram: Photo and user data sharded by user_id
  • Uber: Trips sharded by geohash

Explained In Detail


Next Recommended: Consensus - Learn how shards coordinate in distributed systems