Distributed Sort & Hash
Concept. Distributed sort and hash are the two foundational data-movement primitives in a cluster: hash partitioning routes every row with the same key to the same machine (so joins and group-bys can run locally), and distributed sort orders the rows on each machine so per-key reductions can stream.
Intuition. When Spotify wants to count "lifetime listens per user" across a year of play logs, hash-partitioning by user_id sends every event for Mickey to the same node, every event for Minnie to another, so each node can sort its own subset and emit (user_id, count) pairs without ever talking to peers about other users.
Hash Partitioning: Foundation of Distribution
The Core Problem
When your dataset outgrows your RAM, you need a strategy to split it across multiple machines. Enter hash partitioning, the unsung hero that keeps related data together. It prevents the chaos of scattering a user's information between Ohio and Tokyo.
Consistent Hashing: The Elastic Solution
The Problem: Adding/Removing Machines
Standard hashing (machine = hash(key) % N) is a logistical nightmare when scaling. Change the number of machines, and you're looking at a massive reshuffle:
-
From 3 to 4 machines: 75% of data ends up misplaced.
-
From 100 to 101 machines: A staggering 99% of data is uprooted.
The Solution: Hash Ring
Forget modulo. Think in circles:
-
Create a ring: Numbers 0 to 2³²-1 form a circle.
-
Place machines on the ring:
hash(machine_id)determines their spot. -
Place keys on the ring:
hash(key)finds their position. -
Key ownership rule: Move clockwise from a key until you hit a machine, that's its home.
Key Takeaways
1. Hash Partitioning
The Backbone of Distributed Processing
-
Keeps all data with the same key on the same machine.
-
Enables parallel processing without chaos.
-
Cost: 2N IOs (one read, one write).
-
Scales effortlessly to hundreds of machines.
2. Distributed Hash Joins
Scaling Joins
-
Route tables by join key.
-
Each machine handles its local join.
-
Union results for a complete join.
-
More machines mean linear speedup.
3. Consistent Hashing
Seamless Scaling
-
Maps keys and nodes to a ring, sidestepping modulo.
-
Adding a node only affects its immediate neighbors.
-
On average, only K/N keys move (versus K keys with standard hashing).
-
Utilized by Cassandra, DynamoDB, Memcached.
4. Distributed Sort (BigSort)
Achieving Global Order
-
Phase 1: Local sort on each machine.
-
Phase 2: Sample and determine global split points.
-
Phase 3: Redistribute and finalize sorting.
-
Ensures total order across the cluster.