Hash Partition Join: Joining Tables Bigger Than RAM

Concept. Hash partition join handles the case where neither table fits in RAM by hashing both tables on the join key into matching partitions, then joining each partition pair independently.

Intuition. When both Users and Listens are too big for RAM, hash user_id % 16 to split each into 16 partitions. Mickey's Users row and Mickey's Listens rows land in the same partition, so join partition 1 with partition 1 in RAM, then 2 with 2, etc. Cost: 3(N+M) IOs.

When Both Tables Don't Fit in Memory

The query we want to run:

SELECT u.name, l.song_id, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id;

If Users is 100 GB and Listens is 800 GB but RAM is only 64 GB, neither side fits in memory. BNLJ would re-scan the larger side once per outer block, which is too slow. Sort-merge requires sorting both, which is also expensive. Hash Partition Join is the answer.

Hash Partition Join: The Solution

Core Insight

Partition both tables using the same hash function on the join key. Rows from each side with the same key land in the same partition number, so matching can happen one partition at a time, in RAM.

Hash Partition Join: in Phase 1 both R and S are hashed on the join key into matching partitions (same key always lands in the same partition). In Phase 2 each partition pair is joined independently in RAM.

Figure

Hash Partition Join, R ⋈ S on user_id.

Phase 1, partition (top): hash both tables with the same function. Each R row routes to R₀, R₁, or R₂; each S row routes the same way. Matching user_ids always land in the same partition number.
Phase 2, join (bottom): each (Rᵢ, Sᵢ) pair is small enough to fit in RAM. Build a hash table on the R side, probe with the S side. Three independent joins, parallelizable.
Output: 4 + 2 + 2 = 8 joined rows. R₀⋈S₀ alone hits 4 because user_id 42 has 3 matches and 99 has 1.
Cost.
- Best case: 3(N+M) + OUT. Read each table once, write each partitioned, read partitions back to join, write the output.
- Worst case: hash skew (one hot key blows up a partition past RAM). Recursive re-partitioning or fallback to sort-merge handles it; cost grows by another partition pass per recursion. + OUT still applies.

HashPartitionJoin(T1, T2, key):    // Assume |T1| < |T2|
    B = √(|T2|/RAM)    // partition count

    // Phase 1: Partition both tables (uses B-1 buffers + 1 input)
    for row in T1: write to T1_p[h(row.key) % B]
    for row in T2: write to T2_p[h(row.key) % B]    // same hash!

    // Phase 2: Join matching partitions
    for i in 0..B-1:
        HT = hash_table(T1_p[i])    // build on smaller
        for row in T2_p[i]:          // probe with larger
            if HT[row.key]: output matches

    Approximate Cost: 3(N+M) + OUT IOs

Multi-Table Joins

Joining Three Tables: Users, Listens, Songs

ThreeWayJoin(U, L, S):    // Assume |U| < |L| < |S|
    temp = HPJ(U, L, "user_id")    // U ⋈ L
    result = HPJ(temp, S, "song_id")    // temp ⋈ S
    return result

Approximate Cost: 3(|U|+|L|) + |temp| + 3(|temp|+|S|) + OUT

Handling Skew in HPJ

The figure assumes partitions are roughly balanced. In practice, hot keys break that. If user_id = 1 (a system account) has 10M listens, partition 1 won't fit in RAM even though the others do. The fix is to detect oversized partitions during Phase 1 and recursively re-partition them with a different hash, until each piece fits.

Key Takeaways

HPJ enables joining tables much larger than RAM. The two-phase trick reduces the problem to many small in-RAM joins.
Skew is the only real failure mode. Recursive re-partitioning or fallback to sort-merge handles it.
Foundation of distributed joins in Spark and Hadoop. Same shuffle pattern.

Sort-Merge Join → When data is already sorted (or will be useful sorted for later operations), sort-merge beats hash partition join.