Hash Partitioning: Divide and Conquer

When Your Table is 10× Bigger Than RAM

Hash Partitioning: N Pages → B Partitions Input: Listens Table N = 1000 pages (64GB) Page 1: user_id: [42, 89, 17, 23, 56, 91...] song_id: [101, 205, 303, 404, 505...] Page 2: user_id: [73, 42, 91, 17, 28, 65...] song_id: [707, 808, 909, 101, 202...] Page 3: user_id: [56, 28, 42, 89, 73, 91...] song_id: [303, 404, 505, 606, 707...] ... 997 more pages ... Page 1000: user_id: [91, 17, 65, 28, 42, 56...] song_id: [111, 222, 333, 444, 555...] Input Statistics: • 500M rows total • 500K rows per page • Sequential read from disk B = 100 Output Buffers 6.4GB RAM (100 × 64MB buffers) Hash Function: partition = hash(user_id) % 100 hash(42)%100=7 → Buffer7 | hash(89)%100=34 → Buffer34 100 RAM Buffers (Sample shown): 0 1 7 34 ... 92 99 Each buffer collects rows for ONE partition → flushes to corresponding DBFile Output: 100 DBFiles on Disk Each buffer flushes to its corresponding DBFile 0.dbf 1.dbf 7.dbf 34.dbf ... 92.dbf 99.dbf Example: DBFile 7.dbf (highlighted) • Contains: user_id 42 and all others with hash%100=7 • Size: ~640MB (10 pages of data) • Sorted by user_id for efficient processing • Fits entirely in RAM! Can process independently • 1:1 correspondence: Buffer 7 → DBFile 7.dbf All 100 DBFiles Summary: • Total size = Original 64GB (data preserved) • Each DBFile ≈ 640MB, fits in 6.4GB RAM • Each can be processed independently in Phase 2 Read Write Algorithm: Pass 1 - Partition 1. Read page from input 2. For each row: p = hash(user_id) % B 3. Add row to buffer[p], flush when full

The Hash Partitioning Algorithm

Core Idea

Split the data into B smaller chunks where each chunk contains all rows for a subset of keys.

📋 ALGORITHM: Hash Partitioning

HashPartition(file, B, key):
    output_buffers[B]    // B output buffers + 1 input buffer

    for row in file:
        p = hash(row.key) % B
        output_buffers[p].add(row)
        if output_buffers[p].full:
            write(output_buffers[p] to dbfile_p)
            output_buffers[p].clear()

    flush_all(output_buffers)

    Result: B partitions, each ≈ N/B pages
    Cost: 2N IOs (read + write)
    Key Property: Same key always goes to same partition

Example Problem: Break Big Problem Into Small Problems

Strategy: When facing a problem too large to solve at once, divide it into smaller, manageable pieces that can each be solved independently. Hash partitioning is the perfect example of this divide-and-conquer approach.

64GB
Spotify Listens Table
6.4GB
Available RAM
10×
Size Mismatch
Solution
Hash Partition!

Why Can't We Just Process It All?

-- This query needs to group all data by user_id
SELECT user_id, COUNT(*) as play_count
FROM Listens
GROUP BY user_id;

-- Problem: Need all rows for each user in RAM at once!
-- With 500M rows, this won't fit

Two-Phase Execution: Applying the Algorithm

Now let's see how hash partitioning solves our example problem:

-- PROBLEM: This query requires all user data together
SELECT user_id, COUNT(*) as play_count
FROM Listens GROUP BY user_id;
-- But 64GB data won't fit in 6.4GB RAM!

SOLUTION: Two-Phase Approach

Phase 1: Partition → Hash data into 100 manageable files

Phase 2: Process → Each file independently

-- Process each partition file separately:
for each_partition_file:
    load_file_into_ram()           // Now it fits!
    SELECT user_id, COUNT(*) 
    FROM current_partition 
    GROUP BY user_id;              // Fast in-memory processing

-- Combine results from all 100 partitions = final answer

Cost Analysis

Hash Partitioning Cost: C_r×N + C_w×N

Additional Processing on Partitioned Files

After partitioning, each smaller file can be processed independently:

Memory Requirement

Comparison

Approach Memory Needed Hash Partition Cost Additional Processing Feasible?
Load everything N pages (64GB) N/A C_r×N ❌ Won't fit
Process page-by-page 2 pages N/A Multiple passes ❌ Complex
Hash partition B pages (6.4GB) C_r×N + C_w×N +C_r×N to +(C_r×N + C_w×N) ✅ Perfect!

Use Cases & Real-World Systems

GROUP BY Operations

SELECT user_id, COUNT(*), AVG(rating)
FROM Listens
GROUP BY user_id;

Apache Spark automatically hash partitions by user_id:

df = spark.read.parquet("listens")
df.groupBy("user_id").count()  # Auto-partitioned
df.repartition(200, "user_id")  # Control partition count

Large Joins

PostgreSQL uses hash partitioning for big joins:

EXPLAIN SELECT * FROM huge_table1 JOIN huge_table2 USING (key);
-- Hash Join: partition both tables, join matching partitions

Distributed Processing

Handling Data Skew

# Problem: User 42 has 10M listens, User 99 has 10
# Solution: Multi-level partitioning or pre-splitting hot keys

Key Takeaways

  1. Divide and Conquer: Split big problems into small ones

  2. Hash Ensures Completeness: All related data stays together

  3. Linear IO Cost: 3N is much better than N²

  4. Foundation of Big Data: This is how PostGres, BigQuery, Spark, Hadoop work

  5. Memory Efficient: Process TB of data with GB of RAM