Sort-Merge Join: When Order Matters

Joining Sorted Tables Efficiently

Sort-Merge Join: Sort → Merge → Join Phase 1: Unsorted Tables Users id | name 89 | Bob 42 | Alice 17 | Charlie 56 | David Listens uid | song 42 | S1 89 | S2 17 | S3 42 | S4 Phase 2: BigSort Both Tables Users (sorted) id | name 17 | Charlie 42 | Alice 56 | David 89 | Bob Listens (sorted) uid | song 17 | S3 42 | S1 42 | S4 89 | S2 Phase 3: Merge Join Two-Pointer Merge Merge Process: U: 17 (Charlie) ↓ L: 17 (S3) → Match! 1. Start with both pointers at beginning 2. Compare: U[17] = L[17] → Match found! 3. Output: (17, Charlie, S3) 4. Advance L pointer to next value U: 17 (Charlie) L: 42 (S1) → L > U 5. Compare: U[17] < L[42] → Advance U 6. Continue until both tables processed Single pass through both sorted tables: O(N + M) Join Result Final Output id | name | song ----|---------|----- 17 | Charlie | S3 42 | Alice | S1 42 | Alice | S4 89 | Bob | S2 All matches found efficiently! Duplicates handled correctly Sort-Merge Join Algorithm Phase 1: Sort (using BigSort) 1. Sort Table1 by join_key → sorted_table1 2. Sort Table2 by join_key → sorted_table2 Phase 2: Merge 3. ptr1 = 0, ptr2 = 0 4. While both pointers valid: Compare and advance pointers Output matches

The Sorted Join Advantage

Sorted
Both tables ordered
Linear
Single pass merge
Efficient
O(N+M) merge time
Versatile
Handles all join types

Why Sort for Joins?

-- Once sorted, finding matches is a linear scan!
SELECT u.*, l.*
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
ORDER BY u.user_id;  -- Output already sorted!

Key Insight: When both tables are sorted on the join key, we can find all matches in a single synchronized scan.


The Sort-Merge Join Algorithm

Overview

  1. Sort Phase: Use BigSort to sort both tables by join key

  2. Merge Phase: Linear scan to find matches

📋 PSEUDOCODE: Sort-Merge Join

SortMergeJoin(T1, T2, key):    // Assume |T1| < |T2|
    // Phase 1: Sort both tables
    S1 = BigSort(T1, key)
    S2 = BigSort(T2, key)

    // Phase 2: Merge (uses 2 input buffers + 1 output)
    i = 0, j = 0
    while i < |S1| and j < |S2|:
        if S1[i].key < S2[j].key: i++
        elif S1[i].key > S2[j].key: j++
        else:    // Match found
            output all pairs (S1[i..i'], S2[j..j'])
            where S1[i..i'].key = S2[j..j'].key
            i = i'+1, j = j'+1

    Approximate Cost: 
        Sort: 2N⌈logB(N/B)⌉ + 2M⌈logB(M/B)⌉
        Merge: N + M + OUT

Two-Table Join Example

Joining Users and Listens

SELECT u.name, l.song_id, l.timestamp
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
ORDER BY u.user_id;

Step-by-Step Execution

Given:

- Users: 100 pages (6.4GB)

- Listens: 1000 pages (64GB)

- RAM: 100 pages (6.4GB)

Phase 1: Sort both tables using BigSort
Phase 2: Merge sorted tables

Merge Example:
Users:    [1, 2, 3, 5, 7, 9, 11, ...]
Listens:  [1, 1, 2, 3, 3, 3, 7, 8, ...]

Step 1: Compare Users[1] with Listens[1] → Match!
Step 2: Check for more 1s in Listens → Another match!
Step 3: Advance to Users[2] and Listens[2] → Match!
Step 4: Continue linear scan...

The Merge Process in Detail

Handling Different Cases

Case 1: One-to-Many

Case 2: No Match

Case 3: Many-to-Many


Handling Inequality Joins

Range Joins with Sort-Merge

-- Find all listens within a time range
SELECT u.name, l.song_id
FROM Users u
JOIN Listens l 
  ON u.signup_date <= l.timestamp 
  AND l.timestamp <= u.signup_date + INTERVAL '7 days';

📋 PSEUDOCODE: Range Join

RangeJoin(Users, Listens, days):
    U = BigSort(Users, 'signup_date')
    L = BigSort(Listens, 'timestamp')

    for user in U:
        start = user.signup_date
        end = start + days
        pos = binary_search(L, start)

        while L[pos].timestamp ≤ end:
            if L[pos].timestamp ≥ start:
                output(user, L[pos])
            pos++

Multi-Table Sort-Merge Joins

Three-Way Join: Users, Listens, Songs

SELECT u.name, s.title, l.play_count
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
ORDER BY u.user_id, l.song_id;

Sequential Approach

📋 PSEUDOCODE: Three-Way Join

ThreeWayJoin(U, L, S):    // Assume |U| < |L| < |S|
    temp = SMJ(U, L, 'user_id')    // U ⋈ L
    result = SMJ(temp, S, 'song_id')   // temp ⋈ S

    Approximate Cost: SortCost(U,L,S) + |U|+|L|+|temp|+|S|+OUT

Interesting Property: Maintain Sort Order

Key Insight: Sort-Merge Join preserves the sort order of the outer table


Special Cases and Optimizations

Early Termination

Zig-Zag Merge for Multiple Conditions


Real-World Example: Spotify Analytics

Finding User Listening Patterns

-- Get each user's complete listening history
-- with user details, ordered by user and time
SELECT 
    u.user_id,
    u.name,
    u.country,
    l.song_id,
    l.timestamp,
    s.artist,
    s.title
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
ORDER BY u.user_id, l.timestamp;

Execution Plan:

  1. Sort Users by user_id (6.4GB)

  2. Sort Listens by user_id (64GB)

  3. Sort Songs by song_id (1GB)

  4. Merge Users ⋈ Listens on user_id

  5. Sort intermediate result by song_id

  6. Merge with Songs on song_id

  7. Final sort by user_id, timestamp for output


Key Takeaways

  1. Sort-Merge = BigSort + Linear Merge

  2. Efficient for sorted output - Result comes out ordered

  3. Handles duplicates correctly - Cross product of matching groups

  4. Works for inequality joins - Range queries, temporal joins

  5. Preserves sort order - Enables efficient multi-way joins

  6. Single pass merge - After sorting, only O(N+M) to join

The Punchline: "Sort once, merge forever! When you need ordered results, Sort-Merge Join delivers both the join and the sort."