Hash Indexes

O(1) Lookups for Equality Queries

Hash indexes are the fastest way to find exact matches in a database.

O(1)

Average Lookup Time

Equality Only

WHERE id = 42

No Sorting

Can't do ORDER BY

Memory Based

Best when fits in RAM

Prerequisites: Read Index Fundamentals to understand data pages, index pages, and pointers.

Visual: Physical Layout - Basic Index Structure

Two Ways to Organize Hash Indexes

Both store the exact same data: (value, page#) mappings. The difference is how they organize it for memory management.

Basic Hash Index

What it stores: (value, page#) pairs

# ONE giant dictionary
index = {
  "Alice": 2,     # Alice is in page 2
  "Bob": 0,       # Bob is in page 0  
  "Carol": 0,     # Carol is in page 0
  "Dan": [0,1,4], # Dan in pages 0,1,4
  # ... millions more entries
}

# Lookup is direct:
page_num = index["Alice"]  # Returns 2

The Problem: - 1 billion entries × 16 bytes each = 16 GB - More values? Won't fit in memory! - Need a different approach...

Partitioned Hash Index (Default)

What it stores: Same (value, page#) pairs
But: Hash Partitioned across N smaller index pages

# Split into N=16 smaller indexes
index_page_0 = {"Alice": 2, ...}  # ~64MB
index_page_1 = {"Bob": 0, ...}    # ~64MB
index_page_2 = {"Carol": 0, ...}  # ~64MB
...
index_page_15 = {"Dan": [0,1,4], ...}

# Two-step lookup:
def lookup(key):
    # Step 1: Hash to find WHICH index page
    which_page = hash(key) % 16

    # Step 2: Look up in that page
    return index_pages[which_page][key]

Why this works: - Each index page is only ~64MB (fits in memory) - Hash instantly tells us which page to check - Still O(1) lookup time!

Optional Extensions

Additional optimizations you can apply to either index type

Each extension solves a specific problem - use what you need
Extensions can be combined (e.g., Multi-Column + Bloom on partitioned index)
Trade-offs: Added complexity for specific performance gains

Extension	What It Does	How It Works	When to Use	Trade-off
Bloom Filters	Probabilistic pre-filter	`if bloom.might_contain(k):` `check_index()`	• Filter non-existent keys • 10-100x size reduction • Used by: Cassandra, HBase	False positives possible
Multi-Column	Index multiple columns	`hash(lastname + zip)`	• Compound key queries • `WHERE lastname='X' AND zip='Y'`	More space, faster queries
Sparse Indexes	Index every Nth value	Data: `[1,2,3,4,5...]` Index: `[1→p0, 5→p1...]`	• Sorted data only • N× space reduction • Time-series DBs	Small scan needed
Adaptive	Build index on demand	`if not indexed(k):` `add_to_index(k)`	• Unknown query patterns • Patterns change over time	First queries slower

Common Combinations

Combination	Use Case	Example System
Partitioning + Bloom	Keep partitions small + filter non-existent	Cassandra, HBase
Partitioning + Multi-Column	Partition compound indexes	Distributed SQL databases
Multi-Column + Sparse	Large sorted tables with compound keys	Analytical databases
Partitioning + Adaptive	Learn hot partitions over time	Self-tuning databases
Sparse + Partitioning	Massive time-series data	ClickHouse, InfluxDB

How Hash Indexes Work

PSEUDOCODE: Hash Index

// Build hash index
HashIndex(table, key_column):
    B = choose_bucket_count()    // Often prime number
    index = array[B] of lists

    for row in table:
        key = row[key_column]
        bucket = hash(key) % B
        index[bucket].append((key, row_pointer))

    return index

// Lookup exact match - O(1) average
Lookup(index, search_key):
    bucket = hash(search_key) % B
    for (key, pointer) in index[bucket]:
        if key == search_key:
            return pointer    // Found!
    return null              // Not found

When to Use Hash Indexes

✅ Perfect For:

Primary key lookups: WHERE user_id = 12345
Foreign key joins: JOIN ON orders.user_id = users.id
Exact match filters: WHERE status = 'ACTIVE'
In-memory databases: Redis, Memcached

❌ Cannot Handle:

Range queries: WHERE age BETWEEN 18 AND 25
Sorting: ORDER BY created_at
Prefix matching: WHERE name LIKE 'John%'
Inequality: WHERE price < 100

SQL Examples

-- Hash index WORKS for these:
SELECT * FROM users WHERE user_id = 42;
SELECT * FROM orders WHERE order_id = 'ABC-123';
DELETE FROM sessions WHERE session_key = 'xyz789';

-- Hash index FAILS for these (need B+Tree instead):
SELECT * FROM users WHERE age > 21;
SELECT * FROM orders ORDER BY created_at;
SELECT * FROM products WHERE price < 50;

Performance Comparison

Hash Index

1 hash + 1 lookup

No Index

Full table scan

For 1 billion rows:

Hash Index: ~1-2 operations
Full Table Scan: 1,000,000,000 operations

Real-World Systems

PostgreSQL Hash Indexes

CREATE INDEX idx_user_hash ON users USING HASH (user_id);
-- Warning: Not crash-safe before PostgreSQL 10!

MySQL MEMORY Engine

CREATE TABLE sessions (
    session_id VARCHAR(32) PRIMARY KEY,
    user_id INT,
    data TEXT
) ENGINE=MEMORY;  -- Uses hash indexes by default

Redis

# Redis is essentially a giant hash table
redis.set("user:12345", json_data)  # O(1)
redis.get("user:12345")              # O(1)

Key Takeaways

O(1) lookups - Fastest for exact match queries
Memory-efficient partitioning - Split large indexes across pages
Equality only - Use B+Tree for ranges
Real-world usage - Redis, Memcached, database primary keys