Distributed Sort & Hash
Scaling from 1 to N Machines
Hash Partitioning: Foundation of Distribution
The Core Problem
When your dataset outgrows your RAM, you need a strategy to split it across multiple machines. Enter hash partitioning, the unsung hero that keeps related data together. It prevents the chaos of scattering a user's information between Ohio and Tokyo.
Distributed Hash Join with Spotify Data
Joining Users and Listens Tables
Consistent Hashing: The Elastic Solution
The Problem: Adding/Removing Machines
Standard hashing (machine = hash(key) % N) is a logistical nightmare when scaling. Change the number of machines, and you're looking at a massive reshuffle:
-
From 3 to 4 machines: 75% of data ends up misplaced.
-
From 100 to 101 machines: A staggering 99% of data is uprooted.
The Solution: Hash Ring
Forget modulo. Think in circles:
-
Create a ring: Numbers 0 to 2³²-1 form a circle.
-
Place machines on the ring:
hash(machine_id)determines their spot. -
Place keys on the ring:
hash(key)finds their position. -
Key ownership rule: Move clockwise from a key until you hit a machine—that's its home.
Why Only Neighbors Are Affected
Introduce Machine X:
-
X takes over a segment from its clockwise neighbor.
-
Only keys between X and its predecessor shift to X.
-
The rest remain undisturbed.
Example: Add a machine to a 3-machine setup:
-
Standard hashing: 75% of keys shift.
-
Consistent hashing: Only about 25% move, affecting just a quarter of the ring.
Distributed Sort (BigSort)
3 Phases to Global Order
Key Takeaways
1. Hash Partitioning
The Backbone of Distributed Processing
-
Keeps all data with the same key on the same machine.
-
Enables parallel processing without chaos.
-
Cost: 2N IOs (one read, one write).
-
Scales effortlessly to hundreds of machines.
2. Distributed Hash Joins
Scaling Joins
-
Route tables by join key.
-
Each machine handles its local join.
-
Union results for a complete join.
-
More machines mean linear speedup.
3. Consistent Hashing
Seamless Scaling
-
Maps keys and nodes to a ring, sidestepping modulo.
-
Adding a node only affects its immediate neighbors.
-
On average, only K/N keys move (versus K keys with standard hashing).
-
Utilized by Cassandra, DynamoDB, Memcached.
4. Distributed Sort (BigSort)
Achieving Global Order
-
Phase 1: Local sort on each machine.
-
Phase 2: Sample and determine global split points.
-
Phase 3: Redistribute and finalize sorting.
-
Ensures total order across the cluster.