Hash Partition Join: Joining Tables Bigger Than RAM
When Both Tables Don't Fit in Memory
The Join Problem: Tables Too Big for RAM
The Query We Want
-- Get user details with their listening history
SELECT u.name, u.country, l.song_id, l.plays
FROM Users u
JOIN Listens l ON u.user_id = l.user_id;
-- Problem: Can't load both tables in RAM!
-- Users: 6.4GB + Listens: 64GB = 70.4GB > 6.4GB RAM
Hash Partition Join: The Solution
Core Insight
If we partition both tables using the same hash function on the join key, then:
-
All rows from Table1 with key K go to partition h(K)
-
All rows from Table2 with key K go to partition h(K)
-
Therefore: Matching rows are always in the same partition number!
📋 PSEUDOCODE: Hash Partition Join
HashPartitionJoin(T1, T2, key): // Assume |T1| < |T2|
B = √(|T2|/RAM) // partition count
// Phase 1: Partition both tables (uses B-1 buffers + 1 input)
for row in T1: write to T1_p[h(row.key) % B]
for row in T2: write to T2_p[h(row.key) % B] // same hash!
// Phase 2: Join matching partitions
for i in 0..B-1:
HT = hash_table(T1_p[i]) // build on smaller
for row in T2_p[i]: // probe with larger
if HT[row.key]: output matches
Approximate Cost: 3(N+M) + OUT IOs
Two-Table Join Example
Joining Users and Listens
SELECT u.*, l.*
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE u.user_id = 42;
Step-by-step execution:
Given:
- Users: 100 pages (6.4GB)
- Listens: 1000 pages (64GB)
- RAM: 100 pages (6.4GB)
- B = 10 partitions
Phase 1 - Partition:
1. Read each Users page
2. Hash each user_id: h(42)%10=2, h(89)%10=9, etc.
3. Write to appropriate partition file
4. Repeat for Listens table with SAME hash function
Result after Phase 1:
- 10 Users partition files (~640MB each)
- 10 Listens partition files (~6.4GB each)
- User 42 is in Users_partition_2
- ALL of User 42's listens are in Listens_partition_2
Phase 2 - Join:
For partition 0: Join(Users_p0, Listens_p0)
For partition 1: Join(Users_p1, Listens_p1)
...
For partition 9: Join(Users_p9, Listens_p9)
Each join fits in memory!
Key Point: user_id=42 will be in partition 2 in BOTH tables!
Multi-Table Joins
Joining Three Tables: Users, Listens, Songs
SELECT u.name, s.title, l.plays
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id;
📋 PSEUDOCODE: Multi-Way Join
ThreeWayJoin(U, L, S): // Assume |U| < |L| < |S|
temp = HPJ(U, L, "user_id") // U ⋈ L
result = HPJ(temp, S, "song_id") // temp ⋈ S
return result
Approximate Cost: 3(|U|+|L|) + |temp| + 3(|temp|+|S|) + OUT
Handling Skew in HPJ
Problem: Hot Keys
What if user_id=1 (system account) has 10M listens?
# Partition sizes become uneven:
# Partition 0: 100MB (normal)
# Partition 1: 10GB (hot key!) <-- Won't fit in RAM
# Partition 2: 120MB (normal)
Solution: Handle Skew
When a partition is too large (hot key):
-
Detect: Monitor partition sizes during Phase 1
-
Split: Re-partition the large partition with a different hash
-
Recurse: Apply HPJ recursively to the sub-partitions
Why Hash Partition Join Works
Correctness Guarantee
-
Same hash function on both tables
-
Same join key for partitioning
-
Result: Every matching pair ends up in same partition
Example Trace
User Alice (id=42) in Users table:
hash(42) % 10 = 2 → Goes to Users_partition_2
Alice's listens in Listens table:
Row 1: (user_id=42, song_id=101)
hash(42) % 10 = 2 → Goes to Listens_partition_2
Row 2: (user_id=42, song_id=303)
hash(42) % 10 = 2 → Goes to Listens_partition_2
When we join partition 2:
- Alice is in Users_partition_2
- ALL her listens are in Listens_partition_2
- They meet during the join!
Real-World Usage
PostgreSQL
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM huge1 JOIN huge2 USING (id);
-- Hash Join
-- -> Hash partition huge1
-- -> Hash partition huge2
-- -> Join partitions
Apache Spark
# Spark automatically uses HPJ for large joins
df1 = spark.read.parquet("users")
df2 = spark.read.parquet("listens")
# This triggers hash partition join
result = df1.join(df2, "user_id")
# You can control partitions
result = df1.repartition(200, "user_id") \
.join(df2.repartition(200, "user_id"), "user_id")
Optimization Tips
-
Partition count: Choose B ≈ sqrt(larger_table_size/RAM)
-
Build on smaller: Always build hash table on smaller relation
-
Broadcast small tables: If one table fits in RAM, broadcast it
-
Monitor skew: Track partition sizes, handle hot keys
Key Takeaways
-
HPJ enables joining tables >> RAM
-
Same hash → Same partition ensures correctness
-
Two phases: Partition then join
-
Extends to multiple tables via sequential joins
-
Handle skew with recursive partitioning
-
Foundation of distributed joins in Spark/Hadoop
The Punchline: Hash once, join anywhere! Partitioning turns big joins into parallel local joins.