Sort-Merge Join: When Order Matters

Joining Sorted Tables Efficiently

The Sort-Merge Join Algorithm

Overview

Sort Phase: Use BigSort to sort both tables by join key to order them
Merge Phase: Linear scan to find matches

PSEUDOCODE: Sort-Merge Join

SortMergeJoin(T1, T2, key):    // Assume |T1| < |T2|
    // Phase 1: Sort both tables
    S1 = BigSort(T1, key)
    S2 = BigSort(T2, key)

    // Phase 2: Merge (uses 2 input buffers + 1 output)
    i = 0, j = 0
    while i < |S1| and j < |S2|:
        if S1[i].key < S2[j].key: i++
        elif S1[i].key > S2[j].key: j++
        else:    // Match found
            output all pairs (S1[i..i'], S2[j..j'])
            where S1[i..i'].key = S2[j..j'].key
            i = i'+1, j = j'+1

    Approximate Cost: 
        Sort: BigSort(T1) +BigSort(T2)
        Merge: N + M + OUT

Two-Table Join Example

Joining Users and Listens

SELECT u.name, l.song_id, l.timestamp
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
ORDER BY u.user_id;

The Merge Process

After the Phase1 sorting, we have two sorted tables. Let's walk through the merge process to produce the output.

Case 1: One-to-Many

User 42 appears once in Users
User 42 appears 100 times in Listens
Result: 100 output rows

Case 2: No Match

User 99 in Users but not in Listens
Result: 0 output rows (inner join)

Case 3: Many-to-Many

User 42 appears 2 times in Users
User 42 appears 3 times in Listens
Result: 2 × 3 = 6 output rows (cross product)

Multi-Table Sort-Merge Joins

Three-Way Join: Users, Listens, Songs

SELECT u.name, s.title, l.play_count
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
ORDER BY u.user_id, l.song_id;

Sequential Approach

PSEUDOCODE: Three-Way Join

ThreeWayJoin(U, L, S):    // Assume |U| < |L| < |S|
    temp = SMJ(U, L, 'user_id')    // U ⋈ L
    result = SMJ(temp, S, 'song_id')   // temp ⋈ S

    Approximate Cost: BigSort(U) + BigSort(L) +BigSort(S) + |U|+|L|+|temp|+|S|+OUT

Key Takeaways

Sort-Merge = BigSort + Linear Merge
Efficient for sorted output - Result comes out ordered
Handles duplicates correctly - Cross product of matching groups
Single pass merge - After sorting, only O(N+M) to join