Sort-Merge Join: When Order Matters

Concept. Sort-merge join sorts both tables on the join key, then walks them in lockstep, emitting matching rows wherever the keys agree.

Intuition. When you need to join Users to Listens by user_id and one or both is already sorted (or you have to sort anyway), sort both, then walk them like merging two sorted lists. Mickey's Users row meets Mickey's Listens rows at the same point in the walk.

The Sort-Merge Join Algorithm

Sort both tables on the join key with BigSort, then walk them in lockstep with two pointers, matching as you go.

Sort-Merge Join: after Phase 1 BigSort, both R and S are sorted on the join key. Phase 2 walks both tables in one pass with two pointers, emitting matched groups as it goes.

Figure

Sort-Merge Join, R ⋈ S on user_id (Phase 2 shown).

Phase 1 (not shown): BigSort both R and S on the join key. See the big sort page for the algorithm.
Phase 2 (shown): two pointers walk the sorted runs in lockstep. When the keys match, emit every pair from that key group; when they differ, advance the smaller side.
Match groups in this example: user_id 7 → 2 outputs, 13 → 2, 42 → 3, 99 → 1. Total 8 joined rows. Each match group is a small cross-product within that key.
Cost. Plus the two BigSort costs from Phase 1.
- Best case: P(R) + P(S) + OUT. Single linear pass through each sorted run.
- Worst case: O(P(R) × P(S)) + OUT when one key dominates (e.g. a mega-popular song with millions of plays on both sides). The merge step turns into a per-key cross-product, and OUT blows up with it.

PSEUDOCODE: Sort-Merge Join

SortMergeJoin(T1, T2, key):    // Assume |T1| < |T2|
    // Phase 1: Sort both tables
    S1 = BigSort(T1, key)
    S2 = BigSort(T2, key)

    // Phase 2: Merge (uses 2 input buffers + 1 output)
    i = 0, j = 0
    while i < |S1| and j < |S2|:
        if S1[i].key < S2[j].key: i++
        elif S1[i].key > S2[j].key: j++
        else:    // Match found
            output all pairs (S1[i..i'], S2[j..j'])
            where S1[i..i'].key = S2[j..j'].key
            i = i'+1, j = j'+1

    Approximate Cost: 
        Sort: BigSort(T1) + BigSort(T2)
        Merge: N + M + OUT

Two-Table Join Example

Joining Users and Listens

The Merge Process

Once sorted, the tables are ready for merging. Here's how it unfolds:

Case 1: One-to-Many

User 42 appears once in Users and 100 times in Listens.
Result: 100 output rows.

Case 2: No Match

User 99 is in Users but absent in Listens.
Result: 0 output rows (inner join).

Case 3: Many-to-Many

User 42 appears twice in Users and thrice in Listens.
Result: 6 output rows (cross product).

Multi-Table Sort-Merge Joins

Three-Way Join: Users, Listens, Songs

Sequential Approach

PSEUDOCODE: Three-Way Join

ThreeWayJoin(U, L, S):    // Assume |U| < |L| < |S|
    temp = SMJ(U, L, 'user_id')    // U ⋈ L
    result = SMJ(temp, S, 'song_id')   // temp ⋈ S

    Approximate Cost: BigSort(U) + BigSort(L) + BigSort(S) + |U| + |L| + |temp| + |S| + OUT

Key Takeaways

Result is already sorted on the join key. No extra sort if the next operator wants order.
Worst case is quadratic when one key dominates. The merge step turns into a cross-product within that group.
Always include + OUT. Output IO can dominate when match rates are high.

IO Cost Summary → The reference equations for every algorithm in this module, applied to concrete Spotify-scale scenarios.