Distributed Sort & Hash

Concept. Distributed sort and hash are the two foundational data-movement primitives in a cluster: hash partitioning routes every row with the same key to the same machine (so joins and group-bys can run locally), and distributed sort orders the rows on each machine so per-key reductions can stream.

Intuition. When Spotify wants to count "lifetime listens per user" across a year of play logs, hash-partitioning by user_id sends every event for Mickey to the same node, every event for Minnie to another, so each node can sort its own subset and emit (user_id, count) pairs without ever talking to peers about other users.

Hash Partitioning: Foundation of Distribution

The Core Problem

When your dataset outgrows your RAM, you need a strategy to split it across multiple machines. Enter hash partitioning, the unsung hero that keeps related data together. It prevents the chaos of scattering a user's information between Ohio and Tokyo.


Consistent Hashing: The Elastic Solution

The Problem: Adding/Removing Machines

Standard hashing (machine = hash(key) % N) is a logistical nightmare when scaling. Change the number of machines, and you're looking at a massive reshuffle:

  • From 3 to 4 machines: 75% of data ends up misplaced.

  • From 100 to 101 machines: A staggering 99% of data is uprooted.

The Solution: Hash Ring

Forget modulo. Think in circles:

  1. Create a ring: Numbers 0 to 2³²-1 form a circle.

  2. Place machines on the ring: hash(machine_id) determines their spot.

  3. Place keys on the ring: hash(key) finds their position.

  4. Key ownership rule: Move clockwise from a key until you hit a machine, that's its home.

Consistent hashing ring (0 to 2^32 - 1) showing keys and nodes mapped onto a circular hash space


Key Takeaways

1. Hash Partitioning

The Backbone of Distributed Processing

  • Keeps all data with the same key on the same machine.

  • Enables parallel processing without chaos.

  • Cost: 2N IOs (one read, one write).

  • Scales effortlessly to hundreds of machines.

2. Distributed Hash Joins

Scaling Joins

  • Route tables by join key.

  • Each machine handles its local join.

  • Union results for a complete join.

  • More machines mean linear speedup.

3. Consistent Hashing

Seamless Scaling

  • Maps keys and nodes to a ring, sidestepping modulo.

  • Adding a node only affects its immediate neighbors.

  • On average, only K/N keys move (versus K keys with standard hashing).

  • Utilized by Cassandra, DynamoDB, Memcached.

4. Distributed Sort (BigSort)

Achieving Global Order

  • Phase 1: Local sort on each machine.

  • Phase 2: Sample and determine global split points.

  • Phase 3: Redistribute and finalize sorting.

  • Ensures total order across the cluster.