Hash Indexes

O(1) Lookups for Equality Queries

Hash indexes are the fastest way to find exact matches in a database.

O(1)
Average Lookup Time
Equality Only
WHERE id = 42
No Sorting
Can't do ORDER BY
Memory Based
Best when fits in RAM

Prerequisites: Read Index Fundamentals to understand data pages, index pages, and pointers.


Visual: Physical Layout - Basic Index Structure

Basic Index: Value → Page# Mapping Data Pages (64MB each) Page 0: Bob, Carol, Dan Eve, Frank, ... Page 1: Dan, Grace, Henry Ian, Jack, Kate, ... Page 2: Alice, Lisa, Mike Nancy, Oscar, ... Page 3: Paul, Quinn, Rose Steve, Tom, ... Page 4: Dan, Uma, Victor Wendy, Xavier, Yara, Zoe Hash Index Pages Index Page 0: Alice → Page 2 Bob → Page 0 Carol → Page 0 Index Page 1: Dan → Page 0, 1, 4 Eve → Page 0 Frank → Page 0 Index Page 2: Grace → Page 1 Henry → Page 1 Ian → Page 1 ... (more index pages) Alice Bob Dan (3 locations) Hash Index: hash(value) → Find index page → Lookup page# → Read data page (O(1) lookup)

Two Ways to Organize Hash Indexes

Both store the exact same data: (value, page#) mappings. The difference is how they organize it for memory management.

Basic Hash Index

What it stores: (value, page#) pairs

# ONE giant dictionary
index = {
  "Alice": 2,     # Alice is in page 2
  "Bob": 0,       # Bob is in page 0  
  "Carol": 0,     # Carol is in page 0
  "Dan": [0,1,4], # Dan in pages 0,1,4
  # ... millions more entries
}

# Lookup is direct:
page_num = index["Alice"]  # Returns 2

The Problem: - 1 billion entries × 16 bytes each = 16 GB - More values? Won't fit in memory! - Need a different approach...

Partitioned Hash Index (Default)

What it stores: Same (value, page#) pairs
But: Hash Partitioned across N smaller index pages

# Split into N=16 smaller indexes
index_page_0 = {"Alice": 2, ...}  # ~64MB
index_page_1 = {"Bob": 0, ...}    # ~64MB
index_page_2 = {"Carol": 0, ...}  # ~64MB
...
index_page_15 = {"Dan": [0,1,4], ...}

# Two-step lookup:
def lookup(key):
    # Step 1: Hash to find WHICH index page
    which_page = hash(key) % 16

    # Step 2: Look up in that page
    return index_pages[which_page][key]

Why this works: - Each index page is only ~64MB (fits in memory) - Hash instantly tells us which page to check - Still O(1) lookup time!


Optional Extensions

Additional optimizations you can apply to either index type

  • Each extension solves a specific problem - use what you need
  • Extensions can be combined (e.g., Multi-Column + Bloom on partitioned index)
  • Trade-offs: Added complexity for specific performance gains
Extension What It Does How It Works When to Use Trade-off
Bloom Filters Probabilistic pre-filter if bloom.might_contain(k):
check_index()
• Filter non-existent keys
• 10-100x size reduction
• Used by: Cassandra, HBase
False positives possible
Multi-Column Index multiple columns hash(lastname + zip) • Compound key queries
WHERE lastname='X' AND zip='Y'
More space, faster queries
Sparse Indexes Index every Nth value Data: [1,2,3,4,5...]
Index: [1→p0, 5→p1...]
• Sorted data only
• N× space reduction
• Time-series DBs
Small scan needed
Adaptive Build index on demand if not indexed(k):
add_to_index(k)
• Unknown query patterns
• Patterns change over time
First queries slower

Common Combinations

Combination Use Case Example System
Partitioning + Bloom Keep partitions small + filter non-existent Cassandra, HBase
Partitioning + Multi-Column Partition compound indexes Distributed SQL databases
Multi-Column + Sparse Large sorted tables with compound keys Analytical databases
Partitioning + Adaptive Learn hot partitions over time Self-tuning databases
Sparse + Partitioning Massive time-series data ClickHouse, InfluxDB

How Hash Indexes Work

PSEUDOCODE: Hash Index

// Build hash index
HashIndex(table, key_column):
    B = choose_bucket_count()    // Often prime number
    index = array[B] of lists

    for row in table:
        key = row[key_column]
        bucket = hash(key) % B
        index[bucket].append((key, row_pointer))

    return index

// Lookup exact match - O(1) average
Lookup(index, search_key):
    bucket = hash(search_key) % B
    for (key, pointer) in index[bucket]:
        if key == search_key:
            return pointer    // Found!
    return null              // Not found


When to Use Hash Indexes

✅ Perfect For:

❌ Cannot Handle:


SQL Examples

-- Hash index WORKS for these:
SELECT * FROM users WHERE user_id = 42;
SELECT * FROM orders WHERE order_id = 'ABC-123';
DELETE FROM sessions WHERE session_key = 'xyz789';

-- Hash index FAILS for these (need B+Tree instead):
SELECT * FROM users WHERE age > 21;
SELECT * FROM orders ORDER BY created_at;
SELECT * FROM products WHERE price < 50;

Performance Comparison

Hash Index
1 hash + 1 lookup
No Index
Full table scan

For 1 billion rows:


Real-World Systems

PostgreSQL Hash Indexes

CREATE INDEX idx_user_hash ON users USING HASH (user_id);
-- Warning: Not crash-safe before PostgreSQL 10!

MySQL MEMORY Engine

CREATE TABLE sessions (
    session_id VARCHAR(32) PRIMARY KEY,
    user_id INT,
    data TEXT
) ENGINE=MEMORY;  -- Uses hash indexes by default

Redis

# Redis is essentially a giant hash table
redis.set("user:12345", json_data)  # O(1)
redis.get("user:12345")              # O(1)

Key Takeaways

  1. O(1) lookups - Fastest for exact match queries

  2. Memory-efficient partitioning - Split large indexes across pages

  3. Equality only - Use B+Tree for ranges

  4. Real-world usage - Redis, Memcached, database primary keys