Hash Partitioning: Divide and Conquer

When Your Table is 10× Bigger Than RAM


The Hash Partitioning Algorithm

Core Idea

Break the data into B smaller chunks, ensuring each chunk contains rows for a subset of keys. This is not about elegance; it's about necessity when your table dwarfs your RAM.

📋 ALGORITHM: Hash Partitioning

HashPartition(file, B, key):
    output_buffers[B]    // B output buffers + 1 input buffer

    for row in file:
        p = hash(row.key) % B
        output_buffers[p].add(row)
        if output_buffers[p].full:
            write(output_buffers[p] to dbfile_p)
            output_buffers[p].clear()

    flush_all(output_buffers)

    Result: B partitions, each ≈ N/B pages
    Cost: 2N IOs (read + write)
    Key Property: Same key always goes to same partition

Example Problem: Break Big Problem Into Small Problems

Strategy: Tackle oversized problems by slicing them into digestible segments. Hash partitioning exemplifies this divide-and-conquer tactic.

64GB
Spotify Listens Table
6.4GB
Available RAM
10×
Size Mismatch
Solution
Hash Partition!

Why Can't We Just Process It All?

When your data is ten times the size of your RAM, brute force isn't an option. You need a plan that respects your hardware's limits.


Two-Phase Execution: Applying the Algorithm

Here's how hash partitioning addresses our oversized data dilemma:

SOLUTION: Two-Phase Approach

Phase 1: Partition → Hash data into 100 manageable files

Phase 2: Process → Each file independently


Cost Analysis

Hash Partitioning Cost: C_r×N + C_w×N

Additional Processing on Partitioned Files

After partitioning, each smaller file can be processed independently:


Key Takeaways

  1. Divide and Conquer: Split big problems into small ones. Process TB of data with GB of RAM.

  2. Linear IO Cost: O(N) is much better than N²

  3. Foundation of Big Data: This is how PostGres, BigQuery, Spark, Hadoop work