Sort-Merge Join: When Order Matters
Joining Sorted Tables Efficiently
The Sort-Merge Join Algorithm
Overview
-
Sort Phase: Use BigSort to sort both tables by join key to order them
-
Merge Phase: Linear scan to find matches
PSEUDOCODE: Sort-Merge Join
SortMergeJoin(T1, T2, key): // Assume |T1| < |T2|
// Phase 1: Sort both tables
S1 = BigSort(T1, key)
S2 = BigSort(T2, key)
// Phase 2: Merge (uses 2 input buffers + 1 output)
i = 0, j = 0
while i < |S1| and j < |S2|:
if S1[i].key < S2[j].key: i++
elif S1[i].key > S2[j].key: j++
else: // Match found
output all pairs (S1[i..i'], S2[j..j'])
where S1[i..i'].key = S2[j..j'].key
i = i'+1, j = j'+1
Approximate Cost:
Sort: BigSort(T1) +BigSort(T2)
Merge: N + M + OUT
Two-Table Join Example
Joining Users and Listens
SELECT u.name, l.song_id, l.timestamp
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
ORDER BY u.user_id;
The Merge Process
After the Phase1 sorting, we have two sorted tables. Let's walk through the merge process to produce the output.
Case 1: One-to-Many
-
User 42 appears once in Users
-
User 42 appears 100 times in Listens
-
Result: 100 output rows
Case 2: No Match
-
User 99 in Users but not in Listens
-
Result: 0 output rows (inner join)
Case 3: Many-to-Many
-
User 42 appears 2 times in Users
-
User 42 appears 3 times in Listens
-
Result: 2 × 3 = 6 output rows (cross product)
Multi-Table Sort-Merge Joins
Three-Way Join: Users, Listens, Songs
SELECT u.name, s.title, l.play_count
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
ORDER BY u.user_id, l.song_id;
Sequential Approach
PSEUDOCODE: Three-Way Join
ThreeWayJoin(U, L, S): // Assume |U| < |L| < |S|
temp = SMJ(U, L, 'user_id') // U ⋈ L
result = SMJ(temp, S, 'song_id') // temp ⋈ S
Approximate Cost: BigSort(U) + BigSort(L) +BigSort(S) + |U|+|L|+|temp|+|S|+OUT
Key Takeaways
-
Sort-Merge = BigSort + Linear Merge
-
Efficient for sorted output - Result comes out ordered
-
Handles duplicates correctly - Cross product of matching groups
-
Single pass merge - After sorting, only O(N+M) to join