Hash Partition Join: Joining Tables Bigger Than RAM
Concept. Hash partition join handles the case where neither table fits in RAM by hashing both tables on the join key into matching partitions, then joining each partition pair independently.
Intuition. When both Users and Listens are too big for RAM, hash user_id % 16 to split each into 16 partitions. Mickey's Users row and Mickey's Listens rows land in the same partition, so join partition 1 with partition 1 in RAM, then 2 with 2, etc. Cost: 3(N+M) IOs.
When Both Tables Don't Fit in Memory
The query we want to run:
SELECT u.name, l.song_id, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id;
If Users is 100 GB and Listens is 800 GB but RAM is only 64 GB, neither side fits in memory. BNLJ would re-scan the larger side once per outer block, which is too slow. Sort-merge requires sorting both, which is also expensive. Hash Partition Join is the answer.
Hash Partition Join: The Solution
Core Insight
Partition both tables using the same hash function on the join key. Rows from each side with the same key land in the same partition number, so matching can happen one partition at a time, in RAM.
HashPartitionJoin(T1, T2, key): // Assume |T1| < |T2|
B = √(|T2|/RAM) // partition count
// Phase 1: Partition both tables (uses B-1 buffers + 1 input)
for row in T1: write to T1_p[h(row.key) % B]
for row in T2: write to T2_p[h(row.key) % B] // same hash!
// Phase 2: Join matching partitions
for i in 0..B-1:
HT = hash_table(T1_p[i]) // build on smaller
for row in T2_p[i]: // probe with larger
if HT[row.key]: output matches
Approximate Cost: 3(N+M) + OUT IOs
Multi-Table Joins
Joining Three Tables: Users, Listens, Songs
ThreeWayJoin(U, L, S): // Assume |U| < |L| < |S|
temp = HPJ(U, L, "user_id") // U ⋈ L
result = HPJ(temp, S, "song_id") // temp ⋈ S
return result
Approximate Cost: 3(|U|+|L|) + |temp| + 3(|temp|+|S|) + OUT
Handling Skew in HPJ
The figure assumes partitions are roughly balanced. In practice, hot keys break that. If user_id = 1 (a system account) has 10M listens, partition 1 won't fit in RAM even though the others do. The fix is to detect oversized partitions during Phase 1 and recursively re-partition them with a different hash, until each piece fits.
Key Takeaways
-
HPJ enables joining tables much larger than RAM. The two-phase trick reduces the problem to many small in-RAM joins.
-
Skew is the only real failure mode. Recursive re-partitioning or fallback to sort-merge handles it.
-
Foundation of distributed joins in Spark and Hadoop. Same shuffle pattern.
Next
Sort-Merge Join → When data is already sorted (or will be useful sorted for later operations), sort-merge beats hash partition join.