Sort-Merge Join: When Order Matters
Concept. Sort-merge join sorts both tables on the join key, then walks them in lockstep, emitting matching rows wherever the keys agree.
Intuition. When you need to join Users to Listens by user_id and one or both is already sorted (or you have to sort anyway), sort both, then walk them like merging two sorted lists. Mickey's Users row meets Mickey's Listens rows at the same point in the walk.
The Sort-Merge Join Algorithm
Sort both tables on the join key with BigSort, then walk them in lockstep with two pointers, matching as you go.
PSEUDOCODE: Sort-Merge Join
SortMergeJoin(T1, T2, key): // Assume |T1| < |T2|
// Phase 1: Sort both tables
S1 = BigSort(T1, key)
S2 = BigSort(T2, key)
// Phase 2: Merge (uses 2 input buffers + 1 output)
i = 0, j = 0
while i < |S1| and j < |S2|:
if S1[i].key < S2[j].key: i++
elif S1[i].key > S2[j].key: j++
else: // Match found
output all pairs (S1[i..i'], S2[j..j'])
where S1[i..i'].key = S2[j..j'].key
i = i'+1, j = j'+1
Approximate Cost:
Sort: BigSort(T1) + BigSort(T2)
Merge: N + M + OUT
Two-Table Join Example
Joining Users and Listens
The Merge Process
Once sorted, the tables are ready for merging. Here's how it unfolds:
Case 1: One-to-Many
-
User 42 appears once in Users and 100 times in Listens.
-
Result: 100 output rows.
Case 2: No Match
-
User 99 is in Users but absent in Listens.
-
Result: 0 output rows (inner join).
Case 3: Many-to-Many
-
User 42 appears twice in Users and thrice in Listens.
-
Result: 6 output rows (cross product).
Multi-Table Sort-Merge Joins
Three-Way Join: Users, Listens, Songs
Sequential Approach
PSEUDOCODE: Three-Way Join
ThreeWayJoin(U, L, S): // Assume |U| < |L| < |S|
temp = SMJ(U, L, 'user_id') // U ⋈ L
result = SMJ(temp, S, 'song_id') // temp ⋈ S
Approximate Cost: BigSort(U) + BigSort(L) + BigSort(S) + |U| + |L| + |temp| + |S| + OUT
Key Takeaways
-
Result is already sorted on the join key. No extra sort if the next operator wants order.
-
Worst case is quadratic when one key dominates. The merge step turns into a cross-product within that group.
-
Always include
+ OUT. Output IO can dominate when match rates are high.
Next
IO Cost Summary → The reference equations for every algorithm in this module, applied to concrete Spotify-scale scenarios.