Distributed Sort & Hash

Concept. Distributed sort and hash are the two foundational data-movement primitives in a cluster: hash partitioning routes every row with the same key to the same machine (so joins and group-bys can run locally), and distributed sort orders the rows on each machine so per-key reductions can stream.

Intuition. When Spotify wants to count "lifetime listens per user" across a year of play logs, hash-partitioning by user_id sends every event for Mickey to the same node, every event for Minnie to another, so each node can sort its own subset and emit (user_id, count) pairs without ever talking to peers about other users.

Hash Partitioning: Foundation of Distribution

The Core Problem

When your dataset outgrows your RAM, you need a strategy to split it across multiple machines. Enter hash partitioning, the unsung hero that keeps related data together. It prevents the chaos of scattering a user's information between Ohio and Tokyo.

Consistent Hashing: The Elastic Solution

The Problem: Adding/Removing Machines

Standard hashing (machine = hash(key) % N) is a logistical nightmare when scaling. Change the number of machines, and you're looking at a massive reshuffle:

From 3 to 4 machines: 75% of data ends up misplaced.
From 100 to 101 machines: A staggering 99% of data is uprooted.

The Solution: Hash Ring

Forget modulo. Think in circles:

Create a ring: Numbers 0 to 2³²-1 form a circle.
Place machines on the ring: hash(machine_id) determines their spot.
Place keys on the ring: hash(key) finds their position.
Key ownership rule: Move clockwise from a key until you hit a machine, that's its home.

Consistent hashing ring (0 to 2^32 - 1) showing keys and nodes mapped onto a circular hash space

Key Takeaways

1. Hash Partitioning

The Backbone of Distributed Processing

Keeps all data with the same key on the same machine.
Enables parallel processing without chaos.
Cost: 2N IOs (one read, one write).
Scales effortlessly to hundreds of machines.

2. Distributed Hash Joins

Scaling Joins

Route tables by join key.
Each machine handles its local join.
Union results for a complete join.
More machines mean linear speedup.

3. Consistent Hashing

Seamless Scaling

Maps keys and nodes to a ring, sidestepping modulo.
Adding a node only affects its immediate neighbors.
On average, only K/N keys move (versus K keys with standard hashing).
Utilized by Cassandra, DynamoDB, Memcached.

4. Distributed Sort (BigSort)

Achieving Global Order

Phase 1: Local sort on each machine.
Phase 2: Sample and determine global split points.
Phase 3: Redistribute and finalize sorting.
Ensures total order across the cluster.