Sort-Merge Join: When Order Matters

Joining Sorted Tables Efficiently

Sort-Merge Join: Sort → Merge → Join Phase 1: Unsorted Tables Users id | name 89 | Bob 42 | Alice 17 | Charlie 56 | David Listens uid | song 42 | S1 89 | S2 17 | S3 42 | S4 Phase 2: BigSort Both Tables Users (sorted) id | name 17 | Charlie 42 | Alice 56 | David 89 | Bob Listens (sorted) uid | song 17 | S3 42 | S1 42 | S4 89 | S2 Phase 3: Merge Join Two-Pointer Merge Merge Process: U: 17 (Charlie) ↓ L: 17 (S3) → Match! 1. Start with both pointers at beginning 2. Compare: U[17] = L[17] → Match found! 3. Output: (17, Charlie, S3) 4. Advance L pointer to next value U: 17 (Charlie) L: 42 (S1) → L > U 5. Compare: U[17] < L[42] → Advance U 6. Continue until both tables processed Single pass through both sorted tables: O(N + M) Join Result Final Output id | name | song ----|---------|----- 17 | Charlie | S3 42 | Alice | S1 42 | Alice | S4 89 | Bob | S2 All matches found efficiently! Duplicates handled correctly Sort-Merge Join Algorithm Phase 1: Sort (using BigSort) 1. Sort Table1 by join_key → sorted_table1 2. Sort Table2 by join_key → sorted_table2 Phase 2: Merge 3. ptr1 = 0, ptr2 = 0 4. While both pointers valid: Compare and advance pointers Output matches

The Sort-Merge Join Algorithm

Overview

  1. Sort Phase: Use BigSort to sort both tables by join key to order them

  2. Merge Phase: Linear scan to find matches

PSEUDOCODE: Sort-Merge Join

SortMergeJoin(T1, T2, key):    // Assume |T1| < |T2|
    // Phase 1: Sort both tables
    S1 = BigSort(T1, key)
    S2 = BigSort(T2, key)

    // Phase 2: Merge (uses 2 input buffers + 1 output)
    i = 0, j = 0
    while i < |S1| and j < |S2|:
        if S1[i].key < S2[j].key: i++
        elif S1[i].key > S2[j].key: j++
        else:    // Match found
            output all pairs (S1[i..i'], S2[j..j'])
            where S1[i..i'].key = S2[j..j'].key
            i = i'+1, j = j'+1

    Approximate Cost: 
        Sort: BigSort(T1) +BigSort(T2)
        Merge: N + M + OUT

Two-Table Join Example

Joining Users and Listens

SELECT u.name, l.song_id, l.timestamp
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
ORDER BY u.user_id;

The Merge Process

After the Phase1 sorting, we have two sorted tables. Let's walk through the merge process to produce the output.

Case 1: One-to-Many

Case 2: No Match

Case 3: Many-to-Many


Multi-Table Sort-Merge Joins

Three-Way Join: Users, Listens, Songs

SELECT u.name, s.title, l.play_count
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
ORDER BY u.user_id, l.song_id;

Sequential Approach

PSEUDOCODE: Three-Way Join

ThreeWayJoin(U, L, S):    // Assume |U| < |L| < |S|
    temp = SMJ(U, L, 'user_id')    // U ⋈ L
    result = SMJ(temp, S, 'song_id')   // temp ⋈ S

    Approximate Cost: BigSort(U) + BigSort(L) +BigSort(S) + |U|+|L|+|temp|+|S|+OUT

Key Takeaways

  1. Sort-Merge = BigSort + Linear Merge

  2. Efficient for sorted output - Result comes out ordered

  3. Handles duplicates correctly - Cross product of matching groups

  4. Single pass merge - After sorting, only O(N+M) to join