Sort-Merge Join: When Order Matters
Joining Sorted Tables Efficiently
The Sort-Merge Join Algorithm
Overview
-
Sort Phase: Use BigSort to arrange both tables by their join keys.
-
Merge Phase: Perform a linear scan to identify matches.
PSEUDOCODE: Sort-Merge Join
SortMergeJoin(T1, T2, key): // Assume |T1| < |T2|
// Phase 1: Sort both tables
S1 = BigSort(T1, key)
S2 = BigSort(T2, key)
// Phase 2: Merge (uses 2 input buffers + 1 output)
i = 0, j = 0
while i < |S1| and j < |S2|:
if S1[i].key < S2[j].key: i++
elif S1[i].key > S2[j].key: j++
else: // Match found
output all pairs (S1[i..i'], S2[j..j'])
where S1[i..i'].key = S2[j..j'].key
i = i'+1, j = j'+1
Approximate Cost:
Sort: BigSort(T1) + BigSort(T2)
Merge: N + M + OUT
Two-Table Join Example
Joining Users and Listens
The Merge Process
Once sorted, the tables are ready for merging. Here's how it unfolds:
Case 1: One-to-Many
-
User 42 appears once in Users and 100 times in Listens.
-
Result: 100 output rows.
Case 2: No Match
-
User 99 is in Users but absent in Listens.
-
Result: 0 output rows (inner join).
Case 3: Many-to-Many
-
User 42 appears twice in Users and thrice in Listens.
-
Result: 6 output rows (cross product).
Multi-Table Sort-Merge Joins
Three-Way Join: Users, Listens, Songs
Sequential Approach
PSEUDOCODE: Three-Way Join
ThreeWayJoin(U, L, S): // Assume |U| < |L| < |S|
temp = SMJ(U, L, 'user_id') // U ⋈ L
result = SMJ(temp, S, 'song_id') // temp ⋈ S
Approximate Cost: BigSort(U) + BigSort(L) + BigSort(S) + |U| + |L| + |temp| + |S| + OUT
Key Takeaways
-
Sort-Merge = BigSort + Linear Merge
-
Efficient for sorted output - Result is ordered.
-
Handles duplicates correctly - Cross product of matching groups.
-
Single pass merge - After sorting, only O(N+M) to join.