Sort-Merge Join: When Order Matters
Joining Sorted Tables Efficiently
The Sorted Join Advantage
Why Sort for Joins?
-- Once sorted, finding matches is a linear scan!
SELECT u.*, l.*
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
ORDER BY u.user_id; -- Output already sorted!
Key Insight: When both tables are sorted on the join key, we can find all matches in a single synchronized scan.
The Sort-Merge Join Algorithm
Overview
-
Sort Phase: Use BigSort to sort both tables by join key
-
Merge Phase: Linear scan to find matches
📋 PSEUDOCODE: Sort-Merge Join
SortMergeJoin(T1, T2, key): // Assume |T1| < |T2|
// Phase 1: Sort both tables
S1 = BigSort(T1, key)
S2 = BigSort(T2, key)
// Phase 2: Merge (uses 2 input buffers + 1 output)
i = 0, j = 0
while i < |S1| and j < |S2|:
if S1[i].key < S2[j].key: i++
elif S1[i].key > S2[j].key: j++
else: // Match found
output all pairs (S1[i..i'], S2[j..j'])
where S1[i..i'].key = S2[j..j'].key
i = i'+1, j = j'+1
Approximate Cost:
Sort: 2N⌈logB(N/B)⌉ + 2M⌈logB(M/B)⌉
Merge: N + M + OUT
Two-Table Join Example
Joining Users and Listens
SELECT u.name, l.song_id, l.timestamp
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
ORDER BY u.user_id;
Step-by-Step Execution
Given:
- Users: 100 pages (6.4GB)
- Listens: 1000 pages (64GB)
- RAM: 100 pages (6.4GB)
Phase 1: Sort both tables using BigSort
Phase 2: Merge sorted tables
Merge Example:
Users: [1, 2, 3, 5, 7, 9, 11, ...]
Listens: [1, 1, 2, 3, 3, 3, 7, 8, ...]
Step 1: Compare Users[1] with Listens[1] → Match!
Step 2: Check for more 1s in Listens → Another match!
Step 3: Advance to Users[2] and Listens[2] → Match!
Step 4: Continue linear scan...
The Merge Process in Detail
Handling Different Cases
Case 1: One-to-Many
-
User 42 appears once in Users
-
User 42 appears 100 times in Listens
-
Result: 100 output rows
Case 2: No Match
-
User 99 in Users but not in Listens
-
Result: 0 output rows (inner join)
Case 3: Many-to-Many
-
User 42 appears 2 times in Users
-
User 42 appears 3 times in Listens
-
Result: 2 × 3 = 6 output rows (cross product)
Handling Inequality Joins
Range Joins with Sort-Merge
-- Find all listens within a time range
SELECT u.name, l.song_id
FROM Users u
JOIN Listens l
ON u.signup_date <= l.timestamp
AND l.timestamp <= u.signup_date + INTERVAL '7 days';
📋 PSEUDOCODE: Range Join
RangeJoin(Users, Listens, days):
U = BigSort(Users, 'signup_date')
L = BigSort(Listens, 'timestamp')
for user in U:
start = user.signup_date
end = start + days
pos = binary_search(L, start)
while L[pos].timestamp ≤ end:
if L[pos].timestamp ≥ start:
output(user, L[pos])
pos++
Multi-Table Sort-Merge Joins
Three-Way Join: Users, Listens, Songs
SELECT u.name, s.title, l.play_count
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
ORDER BY u.user_id, l.song_id;
Sequential Approach
📋 PSEUDOCODE: Three-Way Join
ThreeWayJoin(U, L, S): // Assume |U| < |L| < |S|
temp = SMJ(U, L, 'user_id') // U ⋈ L
result = SMJ(temp, S, 'song_id') // temp ⋈ S
Approximate Cost: SortCost(U,L,S) + |U|+|L|+|temp|+|S|+OUT
Interesting Property: Maintain Sort Order
Key Insight: Sort-Merge Join preserves the sort order of the outer table
-
If Users sorted by (user_id, signup_date)
-
And Listens sorted by user_id
-
Result is sorted by (user_id, signup_date)!
-
This enables efficient multi-way joins without re-sorting
Special Cases and Optimizations
Early Termination
-
Stop merge when limit reached
-
Useful for
LIMITclauses -
No need to process entire tables
Zig-Zag Merge for Multiple Conditions
-
Sort by multiple columns: (user_id, timestamp)
-
Enables efficient compound key joins
-
Maintains secondary sort order during merge
Real-World Example: Spotify Analytics
Finding User Listening Patterns
-- Get each user's complete listening history
-- with user details, ordered by user and time
SELECT
u.user_id,
u.name,
u.country,
l.song_id,
l.timestamp,
s.artist,
s.title
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
ORDER BY u.user_id, l.timestamp;
Execution Plan:
-
Sort Users by user_id (6.4GB)
-
Sort Listens by user_id (64GB)
-
Sort Songs by song_id (1GB)
-
Merge Users ⋈ Listens on user_id
-
Sort intermediate result by song_id
-
Merge with Songs on song_id
-
Final sort by user_id, timestamp for output
Key Takeaways
-
Sort-Merge = BigSort + Linear Merge
-
Efficient for sorted output - Result comes out ordered
-
Handles duplicates correctly - Cross product of matching groups
-
Works for inequality joins - Range queries, temporal joins
-
Preserves sort order - Enables efficient multi-way joins
-
Single pass merge - After sorting, only O(N+M) to join
The Punchline: "Sort once, merge forever! When you need ordered results, Sort-Merge Join delivers both the join and the sort."