Sort-Merge Join: When Order Matters

Joining Sorted Tables Efficiently


The Sort-Merge Join Algorithm

Overview

  1. Sort Phase: Use BigSort to arrange both tables by their join keys.

  2. Merge Phase: Perform a linear scan to identify matches.

PSEUDOCODE: Sort-Merge Join

SortMergeJoin(T1, T2, key):    // Assume |T1| < |T2|
    // Phase 1: Sort both tables
    S1 = BigSort(T1, key)
    S2 = BigSort(T2, key)

    // Phase 2: Merge (uses 2 input buffers + 1 output)
    i = 0, j = 0
    while i < |S1| and j < |S2|:
        if S1[i].key < S2[j].key: i++
        elif S1[i].key > S2[j].key: j++
        else:    // Match found
            output all pairs (S1[i..i'], S2[j..j'])
            where S1[i..i'].key = S2[j..j'].key
            i = i'+1, j = j'+1

    Approximate Cost: 
        Sort: BigSort(T1) + BigSort(T2)
        Merge: N + M + OUT

Two-Table Join Example

Joining Users and Listens

The Merge Process

Once sorted, the tables are ready for merging. Here's how it unfolds:

Case 1: One-to-Many

Case 2: No Match

Case 3: Many-to-Many


Multi-Table Sort-Merge Joins

Three-Way Join: Users, Listens, Songs

Sequential Approach

PSEUDOCODE: Three-Way Join

ThreeWayJoin(U, L, S):    // Assume |U| < |L| < |S|
    temp = SMJ(U, L, 'user_id')    // U ⋈ L
    result = SMJ(temp, S, 'song_id')   // temp ⋈ S

    Approximate Cost: BigSort(U) + BigSort(L) + BigSort(S) + |U| + |L| + |temp| + |S| + OUT

Key Takeaways

  1. Sort-Merge = BigSort + Linear Merge

  2. Efficient for sorted output - Result is ordered.

  3. Handles duplicates correctly - Cross product of matching groups.

  4. Single pass merge - After sorting, only O(N+M) to join.