Hash Partitioning: Divide and Conquer

When Your Table is 10× Bigger Than RAM

The Hash Partitioning Algorithm

Core Idea

Split the data into B smaller chunks where each chunk contains all rows for a subset of keys.

📋 ALGORITHM: Hash Partitioning

HashPartition(file, B, key):
    output_buffers[B]    // B output buffers + 1 input buffer

    for row in file:
        p = hash(row.key) % B
        output_buffers[p].add(row)
        if output_buffers[p].full:
            write(output_buffers[p] to dbfile_p)
            output_buffers[p].clear()

    flush_all(output_buffers)

    Result: B partitions, each ≈ N/B pages
    Cost: 2N IOs (read + write)
    Key Property: Same key always goes to same partition

Example Problem: Break Big Problem Into Small Problems

Strategy: When facing a problem too large to solve at once, divide it into smaller, manageable pieces that can each be solved independently. Hash partitioning is the perfect example of this divide-and-conquer approach.

64GB

Spotify Listens Table

6.4GB

Available RAM

10×

Size Mismatch

Solution

Hash Partition!

Why Can't We Just Process It All?

-- This query needs to group all data by user_id
SELECT user_id, COUNT(*) as play_count
FROM Listens
GROUP BY user_id;

-- Problem: Need all rows for each user in RAM at once!
-- With 500M rows, this won't fit

Two-Phase Execution: Applying the Algorithm

Now let's see how hash partitioning solves our example problem:

-- PROBLEM: This query requires all user data together
SELECT user_id, COUNT(*) as play_count
FROM Listens GROUP BY user_id;
-- But 64GB data won't fit in 6.4GB RAM!

SOLUTION: Two-Phase Approach

Phase 1: Partition → Hash data into 100 manageable files

Apply hash(user_id) % 100 to each row
Result: 100 DBFiles, each ~640MB (fits in RAM!)
All data for each user stays in the same partition

Phase 2: Process → Each file independently

-- Process each partition file separately:
for each_partition_file:
    load_file_into_ram()           // Now it fits!
    SELECT user_id, COUNT(*) 
    FROM current_partition 
    GROUP BY user_id;              // Fast in-memory processing

-- Combine results from all 100 partitions = final answer

Cost Analysis

Hash Partitioning Cost: C_r×N + C_w×N

Read N pages from original file → C_r×N cost
Write N pages to B partitioned DBFiles → C_w×N cost
Total: C_r×N + C_w×N IOs for partitioning step
For C_r = C_w = 1: 2N IOs (typical case)

Additional Processing on Partitioned Files

After partitioning, each smaller file can be processed independently:

GROUP BY aggregation: Read each DBFile once → +C_r×N cost
Sorting within partitions: Read + write each DBFile → +(C_r×N + C_w×N) cost
Hash joins: Build hash tables from each DBFile → +C_r×N cost

Key Takeaways

Divide and Conquer: Split big problems into small ones. Process TB of data with GB of RAM.
Linear IO Cost: O(N) is much better than N²
Foundation of Big Data: This is how PostGres, BigQuery, Spark, Hadoop work