Sharding vs Replication

Concept. Sharding splits one logical dataset across many machines so the cluster can hold and serve more data than any single node can; replication keeps multiple copies of each shard on different machines so a single failure doesn't drop a slice of the data offline. Real systems do both.

Intuition. When Spotify's 1 TB Listens table grows past one machine's disk, sharding by user_id spreads it over 10 machines (each holds ~100 GB), and replication keeps 3 copies of every shard on 3 different machines, so any one disk failure is invisible, and any one user's data can be served from any of 3 places.

Sharding splits 10TB of data across 3 machines; replication keeps copies on multiple nodes for fault tolerance

Production Synthesis

As visualized above, sharding divides the massive data volume (handling capacity), while replication ensures uptime and reliability by perfectly copying that data (handling availability).

In the real world, you don't choose between them; you use both.

Sharding Hurdles: Cross-shard queries • Rebalancing • Hotspots Replication Limits: Does not increase write capacity or storage scale.

Implementation Reality

While "sharding" is the generic conceptual term for slicing data, how do you actually enforce it physically across a cluster without losing track of your data? In the next section, we explore Hash Partitioning and Consistent Hashing, the mathematical core that makes modern sharding possible.

Common Confusions

Q: Can replication help with scale?
Replication boosts read scale, not write capacity or data size.

Q: Why not just replicate instead of sharding?
Replication doesn't solve physical limits. A 10TB dataset won't fit on a 1TB drive, no matter how many copies you make. Sharding is essential for overcoming capacity constraints.

Q: How to choose?
Data too big? → Shard. Need high availability? → Replicate. Both? → Implement both.