BigSort: Sorting When Nothing Fits

BigSort for Massive Data

Remember Merge Sort?

Small data (fits in RAM): Split array → Sort pieces → Merge sorted pieces
Big data (doesn't fit): Same idea! Split file → Sort runs → Merge sorted runs

BigSort (also called External Sort) scales merge sort to handle data that's way bigger than RAM.

BigSort: N Pages → 10 Sorted Runs → 1 Sorted File Input: Unsorted Listens N = 1000 pages (64GB) Page 1: user_id: [89, 23, 67, 12, 91...] completely unsorted Page 2: user_id: [56, 78, 34, 90, 11...] random order ... 998 more pages ... Problem: • 64GB unsorted data • Only 6.4GB RAM available Phase 1: Split into Sorted Runs B = 100 Pages 6.4GB RAM Buffer 1. Load 100 pages 2. Quicksort in RAM 3. Write sorted run to disk N/B = 1000/100 = 10 Sorted Runs: R0 .dbf sorted R1 R2 R3 R4 R5 R6 R7 R8 R9 .dbf sorted Phase 2: Merge Sorted Runs 10 Input Buffers + 1 Output Each buffer = 10 pages Input Buffers (one per run): 0 1 2 9 Out Priority Queue merges → Output Buffer Output: Fully Sorted 1000 Pages (64GB) user_id: [1, 1, 2, 2, 3, 4, 5...] ✓ Sorted All 1000 pages globally sorted! Scaling to Bigger Files Current Example: N=1000 pages, B=100 → 10 sorted runs → Single merge pass Bigger Example: N=1M pages, B=100 → 10,000 sorted runs! • Phase 1: Create 10,000 sorted runs of 100 pages each • Phase 2: Multiple merge passes needed (can only merge 100 runs at a time) • Total passes: 1 (for Phase 1) + ⌈log₁₀₀(10,000)⌉ (for Phase 2)= 3 passes to get final sorted file • See main equation sheet for optimized costs with repacking techniques

Interactive BigSort Animation

Watch BigSort in Action

Interactive animation showing split and merge phases with actual data movement


The BigSort Algorithm

Core Idea

  1. Divide: Create sorted runs that fit in memory

  2. Conquer: Merge sorted runs into final output

📋 ALGORITHM: BigSort (External Sort)

BigSort(file, B):    // B = total RAM pages available

    // Phase 1: Create sorted runs
    runs = []
    while not EOF:
        chunk = read(B pages)
        sort(chunk)           // quicksort in RAM
        write(chunk to run_i)
        runs.add(run_i)

    // Phase 2: Merge runs
    while |runs| > 1:
        next_runs = []
        for i = 0 to |runs| step B-1:
            batch = runs[i : i+B-1]   // B-1 input buffers + 1 output buffer
            merged = KWayMerge(batch)
            next_runs.add(merged)
        runs = next_runs
    return runs[0]

Result: Fully sorted file
Cost: C_r×N + C_w×N per pass
Key Property: Handles any file size with limited RAM

Example Problem: Break Big Sort Into Small Sorts

Strategy: When you need to sort data that's too big to fit in memory, break it into small pieces you can sort, then merge the sorted pieces back together. This divide-and-conquer approach makes the impossible possible.

64GB
Unsorted Table
6.4GB
Available RAM
10×
Size Mismatch
BigSort
The Solution

Why Sort?

-- These queries need sorted data:
SELECT * FROM Users ORDER BY last_active DESC;
SELECT DISTINCT user_id FROM Listens;
SELECT * FROM Songs ORDER BY play_count DESC LIMIT 100;

-- Sorting enables:
-- 1. ORDER BY clauses
-- 2. Efficient DISTINCT
-- 3. Sort-merge joins
-- 4. Rank/percentile calculations

Two-Phase Execution: Applying the Algorithm

-- PROBLEM: Need to sort this massive table
SELECT * FROM Listens ORDER BY user_id;
-- But 64GB data won't fit in 6.4GB RAM for sorting!

SOLUTION: Two-Phase Approach

Phase 1: Split → Create sorted runs

Phase 2: Merge → Combine using priority queue


Cost Analysis

IO Cost Reference

For detailed IO cost formulas, see IO Cost Reference.

BigSort vs Hash Partitioning: Cost Intuition

Hash Partitioning: Always 2N IOs

BigSort: Variable cost - grows logarithmically

Why the difference? Hash Partitioning distributes data in one shot, but BigSort must repeatedly merge until everything is sorted.


Use Cases & Real-World Systems

ORDER BY Queries

-- BigQuery uses BigSort for massive ORDER BY
SELECT * FROM billion_row_table ORDER BY timestamp;
-- Automatically partitions and sorts across many machines

PostgreSQL

-- PostgreSQL work_mem setting controls when to use BigSort
SET work_mem = '1GB';
SELECT * FROM large_table ORDER BY created_at;
-- Uses BigSort if data > work_mem

Distributed Systems


Key Takeaways

  1. Divide and Conquer: Split large sorts into small sorts, then merge

  2. Two-Phase Process: Split creates sorted runs, merge combines them

  3. Scalable Cost: 2N × passes - works for any data size with fixed RAM

  4. Foundation of Big Data: This is how PostgreSQL, BigQuery, Spark handle ORDER BY

  5. Memory Efficient: Sort TB of data with GB of RAM

The Punchline: When in doubt, sort it out! Sorted data enables countless optimizations downstream.