Sort-Merge Join: When Order Matters

Concept. Sort-merge join sorts both tables on the join key, then walks them in lockstep, emitting matching rows wherever the keys agree.

Intuition. When you need to join Users to Listens by user_id and one or both is already sorted (or you have to sort anyway), sort both, then walk them like merging two sorted lists. Mickey's Users row meets Mickey's Listens rows at the same point in the walk.

The Sort-Merge Join Algorithm

Sort both tables on the join key with BigSort, then walk them in lockstep with two pointers, matching as you go.

Sort-Merge Join: after Phase 1 BigSort, both R and S are sorted on the join key. Phase 2 walks both tables in one pass with two pointers, emitting matched groups as it goes.

PSEUDOCODE: Sort-Merge Join

SortMergeJoin(T1, T2, key):    // Assume |T1| < |T2|
    // Phase 1: Sort both tables
    S1 = BigSort(T1, key)
    S2 = BigSort(T2, key)

    // Phase 2: Merge (uses 2 input buffers + 1 output)
    i = 0, j = 0
    while i < |S1| and j < |S2|:
        if S1[i].key < S2[j].key: i++
        elif S1[i].key > S2[j].key: j++
        else:    // Match found
            output all pairs (S1[i..i'], S2[j..j'])
            where S1[i..i'].key = S2[j..j'].key
            i = i'+1, j = j'+1

    Approximate Cost: 
        Sort: BigSort(T1) + BigSort(T2)
        Merge: N + M + OUT

Two-Table Join Example

Joining Users and Listens

The Merge Process

Once sorted, the tables are ready for merging. Here's how it unfolds:

Case 1: One-to-Many

  • User 42 appears once in Users and 100 times in Listens.

  • Result: 100 output rows.

Case 2: No Match

  • User 99 is in Users but absent in Listens.

  • Result: 0 output rows (inner join).

Case 3: Many-to-Many

  • User 42 appears twice in Users and thrice in Listens.

  • Result: 6 output rows (cross product).


Multi-Table Sort-Merge Joins

Three-Way Join: Users, Listens, Songs

Sequential Approach

PSEUDOCODE: Three-Way Join

ThreeWayJoin(U, L, S):    // Assume |U| < |L| < |S|
    temp = SMJ(U, L, 'user_id')    // U ⋈ L
    result = SMJ(temp, S, 'song_id')   // temp ⋈ S

    Approximate Cost: BigSort(U) + BigSort(L) + BigSort(S) + |U| + |L| + |temp| + |S| + OUT

Key Takeaways

  1. Result is already sorted on the join key. No extra sort if the next operator wants order.

  2. Worst case is quadratic when one key dominates. The merge step turns into a cross-product within that group.

  3. Always include + OUT. Output IO can dominate when match rates are high.


Next

IO Cost Summary → The reference equations for every algorithm in this module, applied to concrete Spotify-scale scenarios.