Distributed Sort & Hash

Scaling from 1 to N Machines


Hash Partitioning: Foundation of Distribution

The Core Problem

When your dataset outgrows your RAM, you need a strategy to split it across multiple machines. Enter hash partitioning, the unsung hero that keeps related data together. It prevents the chaos of scattering a user's information between Ohio and Tokyo.


Distributed Hash Join with Spotify Data

Joining Users and Listens Tables


Consistent Hashing: The Elastic Solution

The Problem: Adding/Removing Machines

Standard hashing (machine = hash(key) % N) is a logistical nightmare when scaling. Change the number of machines, and you're looking at a massive reshuffle:

The Solution: Hash Ring

Forget modulo. Think in circles:

  1. Create a ring: Numbers 0 to 2³²-1 form a circle.

  2. Place machines on the ring: hash(machine_id) determines their spot.

  3. Place keys on the ring: hash(key) finds their position.

  4. Key ownership rule: Move clockwise from a key until you hit a machine—that's its home.

Why Only Neighbors Are Affected

Introduce Machine X:

Example: Add a machine to a 3-machine setup:


Distributed Sort (BigSort)

3 Phases to Global Order


Key Takeaways

1. Hash Partitioning

The Backbone of Distributed Processing

2. Distributed Hash Joins

Scaling Joins

3. Consistent Hashing

Seamless Scaling

4. Distributed Sort (BigSort)

Achieving Global Order