BigSort: Sorting TBs of Data with GBs of RAM

BigSort for Massive Data

Remember Merge Sort?

Small data (fits in RAM): Split array → Sort pieces → Merge sorted pieces
Big data (doesn't fit): Same idea! Split file → Sort runs → Merge sorted runs

BigSort (also called External Sort) scales merge sort to handle data that's way bigger than RAM.

BigSort: N Pages → 10 Sorted Runs → 1 Sorted File Input: Unsorted Listens N = 1000 pages (64GB) Page 1: user_id: [89, 23, 67, 12, 91...] completely unsorted Page 2: user_id: [56, 78, 34, 90, 11...] random order ... 998 more pages ... Problem: • 64GB unsorted data • Only 6.4GB RAM available Phase 1: Split into Sorted Runs B = 100 Pages 6.4GB RAM Buffer 1. Load 100 pages 2. Quicksort in RAM 3. Write sorted run to disk N/B = 1000/100 = 10 Sorted Runs: R0 .dbf sorted R1 R2 R3 R4 R5 R6 R7 R8 R9 .dbf sorted Phase 2: Merge Sorted Runs 10 Input Buffers + 1 Output Each buffer = 10 pages Input Buffers (one per run): 0 1 2 9 Out Priority Queue merges → Output Buffer Output: Fully Sorted 1000 Pages (64GB) user_id: [1, 1, 2, 2, 3, 4, 5...] ✓ Sorted All 1000 pages globally sorted! Scaling to Bigger Files Current Example: N=1000 pages, B=100 → 10 sorted runs → Single merge pass Bigger Example: N=1M pages, B=100 → 10,000 sorted runs! • Phase 1: Create 10,000 sorted runs of 100 pages each • Phase 2: Multiple merge passes needed (can only merge 100 runs at a time)

Interactive BigSort Animation

Watch BigSort in Action

Interactive animation showing split and merge phases with actual data movement


The BigSort Algorithm (aka External Sort)

Core Idea

  1. Divide: Create sorted runs that fit in memory

  2. Conquer: Merge sorted runs into final output

ALGORITHM: BigSort

BigSort(file, B):    // B = total RAM pages available

    // Phase 1: Create sorted runs
    runs = []
    while not EOF:
        chunk = read(B pages)
        sort(chunk)           // quicksort in RAM
        write(chunk to run_i)
        runs.add(run_i)

    // Phase 2: Merge runs
    while |runs| > 1:
        next_runs = []
        for i = 0 to |runs| step B-1:
            batch = runs[i : i+B-1]   // B-1 input buffers + 1 output buffer
            merged = KWayMerge(batch)
            next_runs.add(merged)
        runs = next_runs
    return runs[0]

Result: Fully sorted file
Cost: C_r×N + C_w×N per pass
Key Property: Handles any file size with limited RAM

Why Sort?

-- These queries need sorted data:
SELECT * FROM Users ORDER BY last_active DESC;
SELECT DISTINCT user_id FROM Listens;
SELECT * FROM Songs ORDER BY play_count DESC LIMIT 100;

-- Sorting enables:
-- 1. ORDER BY clauses
-- 2. Efficient DISTINCT
-- 3. Sort-merge joins
-- 4. Rank/percentile calculations

Example Problem: Break Big Sort Into Small Sorts

64GB
Unsorted Table
6.4GB
Available RAM
10×
Size Mismatch
BigSort
The Solution

SOLUTION: Two-Phase Approach

Phase 1: Split → Create sorted runs

Phase 2: Merge → Combine using priority queue


Cost Analysis

IO Cost Reference

For detailed IO cost formulas, see IO Cost Reference.

BigSort vs Hash Partitioning: Cost Intuition

Hash Partitioning: Always 2N IOs

BigSort: Variable cost - grows logarithmically

Why the difference? Hash Partitioning distributes data in one shot, but BigSort must repeatedly merge until everything is sorted.


Key Takeaways

  1. Divide and Conquer: Split large sorts into small sorts, then merge; Sort TB of data with GB of RAM

  2. Scalable Cost: Works for any big data size with fixed RAM

  3. Foundation of Big Data: This is how PostgreSQL, BigQuery, Spark handle ORDER BY