In the previous article, you learned about consistent hashing. Now let us dive deep into data partitioning — how to split your database across multiple machines when one is not enough.
Why Partition Data?
A single database server has limits. It can only store so much data and handle so many queries per second. When you hit those limits, you have two choices:
- Vertical scaling — buy a bigger machine (expensive, has a ceiling)
- Horizontal scaling — split data across multiple machines (partitioning)
Single Database (No Partitioning):
[All 500M users] --> [One Database Server]
|
|--> 10TB of data
|--> 50,000 queries/sec
|--> Single point of failure
|--> $$$$ for a huge machine
Partitioned Database:
[Users A-M] --> [Database Shard 1] (5TB, 25K qps)
[Users N-Z] --> [Database Shard 2] (5TB, 25K qps)
Each shard handles half the data and half the traffic.
If one shard goes down, only half the users are affected.
Horizontal vs Vertical Partitioning
There are two ways to split data.
Horizontal Partitioning (Sharding)
Split rows across multiple databases. Each shard has the same schema but different rows.
Horizontal Partitioning (Sharding):
Original table: users (500M rows)
Shard 1: users where user_id 1 to 100M
Shard 2: users where user_id 100M to 200M
Shard 3: users where user_id 200M to 300M
Shard 4: users where user_id 300M to 400M
Shard 5: users where user_id 400M to 500M
Each shard has the same columns (id, name, email, ...)
but different rows.
This is the most common form of partitioning. When people say “sharding,” they usually mean horizontal partitioning.
Vertical Partitioning
Split columns across multiple databases. Each partition has different columns for the same rows.
Vertical Partitioning:
Original table: users
(id, name, email, bio, avatar_url, preferences, login_history)
Partition 1 (core data):
users_core (id, name, email)
--> Accessed on every request, kept in fast storage
Partition 2 (profile data):
users_profile (id, bio, avatar_url, preferences)
--> Accessed less often, can be on slower storage
Partition 3 (activity data):
users_activity (id, login_history)
--> Write-heavy, separate to avoid locking core data
Vertical partitioning is useful when some columns are accessed much more often than others. But it is limited — you eventually need horizontal partitioning for scale.
Sharding Strategies
1. Range-Based Sharding
Split data by ranges of the shard key.
Range-Based Sharding:
Shard key: user_id
Shard 1: user_id 1 to 10,000,000
Shard 2: user_id 10,000,001 to 20,000,000
Shard 3: user_id 20,000,001 to 30,000,000
Shard 4: user_id 30,000,001 to 40,000,000
To find user 15,000,003:
--> Falls in range 10M-20M
--> Go to Shard 2
Pros:
- Simple to understand and implement
- Range queries are efficient (e.g., “get all users with ID 100-200” hits one shard)
- Easy to add new shards (just extend the range)
Cons:
- Can create hot spots. If new users get sequential IDs, Shard 4 (the newest shard) gets all the write traffic while older shards sit idle
- Uneven data distribution if the key is not uniformly distributed
Hot Spot Problem with Range Sharding:
Shard 1: user_id 1-10M (old users, low activity) --> 100 qps
Shard 2: user_id 10M-20M (medium users, some activity) --> 500 qps
Shard 3: user_id 20M-30M (newer users, active) --> 2,000 qps
Shard 4: user_id 30M-40M (newest users, all new writes) --> 10,000 qps
Shard 4 is overwhelmed while Shard 1 is nearly idle.
2. Hash-Based Sharding
Hash the shard key and use the hash to determine the shard.
Hash-Based Sharding:
shard_number = hash(user_id) % number_of_shards
hash(user_id: 1) = 89234 % 4 = 2 --> Shard 2
hash(user_id: 1000001) = 45612 % 4 = 0 --> Shard 0
hash(user_id: 2000001) = 73281 % 4 = 1 --> Shard 1
hash(user_id: 3999999) = 12890 % 4 = 2 --> Shard 2
Data is evenly distributed regardless of the key pattern.
Pros:
- Even data distribution (no hot spots from sequential keys)
- Works with any key type
Cons:
- Range queries are expensive (must query all shards)
- Adding or removing shards requires rehashing (use consistent hashing to minimize this)
For hash-based sharding, you should use consistent hashing (covered in the previous article) to minimize data movement when the number of shards changes.
3. Directory-Based Sharding
Use a lookup table that maps each key to its shard.
Directory-Based Sharding:
Lookup Table:
user_id 1-500 --> Shard A
user_id 501-1200 --> Shard B
user_id 1201-1500 --> Shard A (note: non-contiguous!)
user_id 1501-3000 --> Shard C
user_id "premium" --> Shard D (special rule)
The directory service stores these mappings.
Every query first checks the directory to find the right shard.
Pros:
- Maximum flexibility — any mapping rule you want
- Easy to move specific ranges between shards
- Can apply custom logic (e.g., VIP users on dedicated shard)
Cons:
- The directory is a single point of failure
- Extra network hop for every query (to check the directory)
- Directory becomes a bottleneck at high traffic
4. Geography-Based Sharding
Split data based on the user’s geographic location.
Geography-Based Sharding:
Shard EU: users in Europe --> Frankfurt data center
Shard US: users in Americas --> Virginia data center
Shard APAC: users in Asia-Pacific --> Singapore data center
User in Berlin: --> Shard EU (low latency: ~5ms)
User in New York: --> Shard US (low latency: ~5ms)
User in Tokyo: --> Shard APAC (low latency: ~5ms)
Without geo-sharding:
User in Tokyo --> US data center (high latency: ~200ms)
Pros:
- Low latency — data is physically close to the user
- Compliance — EU data stays in EU (GDPR)
- Natural traffic separation (different time zones peak at different times)
Cons:
- Uneven distribution (more users in some regions)
- Cross-region queries are slow
- Complex data migration when users move regions
Choosing a Shard Key
The shard key is the most important decision in any sharding strategy. A bad shard key leads to hot spots, uneven data distribution, and complex queries.
Good Shard Key Properties
Good Shard Key Checklist:
1. High cardinality
--> Many unique values (user_id: millions of values)
--> NOT status ("active"/"inactive" — only 2 values)
2. Even distribution
--> Values are spread uniformly
--> NOT created_date (recent dates get all traffic)
3. Query alignment
--> Most queries include the shard key
--> If 90% of queries filter by user_id, use user_id
4. Low cross-shard operations
--> Queries rarely need data from multiple shards
--> A query like "get user's orders" should hit ONE shard
Shard Key Examples
E-commerce Application:
Good shard key: user_id
--> Each user's data (orders, cart, preferences) on one shard
--> "Get user's orders" hits one shard
--> Even distribution (millions of users)
Bad shard key: order_status
--> Only a few values: "pending", "shipped", "delivered"
--> "pending" shard gets all new orders (hot spot)
Bad shard key: created_date
--> Today's shard gets all new data
--> Old shards sit idle
Chat Application:
Good shard key: conversation_id
--> All messages in a conversation on one shard
--> "Get messages for conversation" hits one shard
Bad shard key: sender_id
--> A group message requires reading from many shards
(one per group member)
Problems with Sharding
1. Cross-Shard Queries
When a query needs data from multiple shards, it must be sent to all shards and the results merged.
Cross-Shard Query Problem:
Sharded by user_id.
Query: "Find all orders over $100 in the last month"
--> This query does not filter by user_id
--> Must send to ALL shards
--> Each shard runs the query locally
--> Results are merged by the application
This is called "scatter-gather" and it is slow.
2. Cross-Shard Joins
Joining data across shards is very expensive because the data lives on different machines.
Cross-Shard Join Problem:
Shard 1 has user 1001 (in Berlin)
Shard 3 has user 5003 (in Tokyo)
Query: "Find all users who ordered the same product as user 1001"
--> Need to check orders on Shard 1
--> Then check all other shards for matching products
--> Very expensive network operation
Solution: Denormalize data. Store product information
alongside orders so you do not need cross-shard joins.
3. Rebalancing Shards
When one shard gets too large or too hot, you need to split it or move data to other shards.
Rebalancing Approaches:
1. Fixed Number of Partitions:
Create more partitions than servers from the start.
10 servers, 1000 partitions
Each server handles 100 partitions.
Add 1 server (11 total):
Move some partitions from existing servers to the new one.
Each server now handles ~91 partitions.
Used by: Elasticsearch, Riak, Couchbase
2. Dynamic Partitioning:
Split a partition when it gets too large.
Merge partitions when they get too small.
Partition A grows beyond 10GB --> split into A1 and A2
Partition A1 shrinks below 1GB --> merge back
Used by: HBase, MongoDB
3. Proportional to Nodes:
Fixed number of partitions per node.
When a new node joins, it splits existing partitions.
Used by: Cassandra (256 virtual nodes per node)
4. Hotspots
Even with a good shard key, some data is accessed more than others.
Celebrity Problem:
Shard key: user_id
Hash-based sharding (even distribution)
User "celebrity_123" has 50M followers.
Every post by celebrity_123 generates 50M fan-out operations.
The shard holding celebrity_123 is overwhelmed.
Solutions:
1. Add a random suffix: celebrity_123_01, celebrity_123_02
Spread the celebrity's data across multiple shards.
Application must query all suffixed keys and merge results.
2. Dedicated shard for hot keys
Monitor traffic and move hot keys to dedicated resources.
3. Cache layer in front of the shard
Cache frequently accessed data to reduce shard load.
Real-World Examples
How Instagram Shards PostgreSQL
Instagram uses PostgreSQL with hash-based sharding. Their shard key is the user ID embedded in every row’s primary key. They use a custom ID generation scheme:
Instagram's ID Format:
ID = timestamp (41 bits) + shard_id (13 bits) + sequence (10 bits)
This encodes the shard directly in the ID.
No directory lookup needed — the shard is in the ID itself.
41 bits for timestamp: ~69 years of unique timestamps
13 bits for shard_id: 8,192 possible shards
10 bits for sequence: 1,024 IDs per millisecond per shard
How Discord Shards Cassandra
Discord stores billions of messages in Cassandra, sharded by channel_id and a time bucket.
Discord's Sharding Strategy:
Partition key: (channel_id, bucket)
bucket = message_timestamp / 10_days
Each partition holds ~10 days of messages for one channel.
This keeps partition sizes manageable.
Reading recent messages: hit 1 partition
Reading old messages: hit a different partition (cold storage)
This avoids the "unbounded partition" problem where
a busy channel's partition grows forever.
When NOT to Shard
Most applications do not need sharding. Before you shard, try these simpler approaches first:
Scaling Checklist (before sharding):
1. Optimize queries (add indexes, rewrite slow queries)
2. Add read replicas (for read-heavy workloads)
3. Add a cache layer (Redis for hot data)
4. Vertical partitioning (separate hot/cold columns)
5. Archive old data (move old rows to cold storage)
6. Vertical scaling (bigger machine)
If NONE of these work --> consider sharding.
Rules of thumb:
< 100GB of data: probably do not need sharding
< 10K queries/sec: probably do not need sharding
Single PostgreSQL can handle 500K+ rows/sec writes
and serve 100K+ queries/sec with proper indexes.
Sharding adds significant complexity. You lose cross-shard joins, transactions become harder, and your application needs to know about the sharding scheme. Only shard when the benefits outweigh the costs.
Interview Tips
When discussing data partitioning in interviews:
Start with why. “The data is too large for a single server, so I will shard the database by user_id.”
Choose the right strategy. “I will use hash-based sharding with consistent hashing to ensure even distribution and minimal data movement when adding shards.”
Justify your shard key. “I chose user_id because most queries filter by user. This means most queries hit a single shard.”
Address cross-shard queries. “For queries that span multiple shards, like ’trending products,’ I will use a separate denormalized table or a search index.”
Mention rebalancing. “I will use fixed partitioning with more partitions than servers, so adding a server just moves some partitions.”
Know when NOT to shard. “For our estimated data size of 50GB, a single PostgreSQL instance with read replicas and caching should be sufficient. Sharding would add unnecessary complexity.”
Related Articles
- System Design #11: Consistent Hashing — Hash rings and virtual nodes
- System Design #5: Databases — SQL vs NoSQL, replication, indexing
- System Design #4: Caching — Redis, Memcached, CDN
What’s Next?
In the next article, System Design #13: Design a URL Shortener, you will learn:
- How to design a URL shortener like bit.ly from scratch
- Base62 encoding for short codes
- Back-of-the-envelope estimation for 100M URLs per day
- Caching, sharding, and analytics
This is part 12 of the System Design Tutorial series. Follow along to learn system design from scratch.