BigSort: Sorting TBs of Data with GBs of RAM

BigSort for Massive Data

Merge Sort Revisited

When your data's small enough to fit in RAM, merge sort's a neat trick: divide the array, sort the pieces, and merge them back together. But when you're dealing with data that dwarfs your memory, you need a bigger plan. Enter BigSort, or External Sort, which scales this idea to handle data that laughs at your RAM limits.


Interactive BigSort Animation

Watch BigSort in Action

Interactive animation showing split and merge phases with actual data movement


The BigSort Algorithm (aka External Sort)

Core Idea

  1. Divide: Create sorted runs that fit in memory.

  2. Conquer: Merge those sorted runs into your final output.

ALGORITHM: BigSort

BigSort(file, B):    // B = total RAM pages available

    // Phase 1: Create sorted runs
    runs = []
    while not EOF:
        chunk = read(B pages)
        sort(chunk)           // quicksort in RAM
        write(chunk to run_i)
        runs.add(run_i)

    // Phase 2: Merge runs
    while |runs| > 1:
        next_runs = []
        for i = 0 to |runs| step B-1:
            batch = runs[i : i+B-1]   // B-1 input buffers + 1 output buffer
            merged = KWayMerge(batch)
            next_runs.add(merged)
        runs = next_runs
    return runs[0]

Result: Fully sorted file
Cost: C_r×N + C_w×N per pass
Key Property: Handles any file size with limited RAM

Why Sort?

Sorting isn't just an academic exercise. It's the backbone of operations in databases and big data systems. Whether you're running a query in PostgreSQL, BigQuery, or Spark, sorting is how you get ordered results, and it's crucial for performance and efficiency.


Example Problem: Break Big Sort Into Small Sorts

64GB
Unsorted Table
6.4GB
Available RAM
10×
Size Mismatch
BigSort
The Solution

SOLUTION: Two-Phase Approach

Phase 1: Split → Create sorted runs

Phase 2: Merge → Combine using priority queue


Cost Analysis

IO Cost Reference

For detailed IO cost formulas, see IO Cost Reference.

BigSort vs Hash Partitioning: Cost Intuition

Hash Partitioning: Always 2N IOs

BigSort: Variable cost—grows logarithmically

Why the difference? Hash Partitioning distributes data in one shot, but BigSort must repeatedly merge until everything is sorted.


Key Takeaways

  1. Divide and Conquer: Split large sorts into small sorts, then merge; Sort TB of data with GB of RAM.

  2. Scalable Cost: Works for any big data size with fixed RAM.

  3. Foundation of Big Data: This is how PostgreSQL, BigQuery, Spark handle ORDER BY.