Hash Indexes
O(1) Lookups for Equality Queries
Hash indexes are the fastest way to find exact matches in a database.
Prerequisites: Read Index Fundamentals to understand data pages, index pages, and pointers.
Visual: Physical Layout - Basic Index Structure
Two Ways to Organize Hash Indexes
Basic Hash Index
What it stores: (value, page#) pairs
# ONE giant dictionary
index = {
"Alice": 2, # Alice is in page 2
"Bob": 0, # Bob is in page 0
"Carol": 0, # Carol is in page 0
"Dan": [0,1,4], # Dan in pages 0,1,4
# ... millions more entries
}
# Lookup is direct:
page_num = index["Alice"] # Returns 2
The Problem: - 1 billion entries × 16 bytes each = 16 GB - More values? Won't fit in memory! - Need a different approach...
Partitioned Hash Index (Default)
What it stores: Same (value, page#) pairs
But: Hash Partitioned across N smaller index pages
# Split into N=16 smaller indexes
index_page_0 = {"Alice": 2, ...} # ~64MB
index_page_1 = {"Bob": 0, ...} # ~64MB
index_page_2 = {"Carol": 0, ...} # ~64MB
...
index_page_15 = {"Dan": [0,1,4], ...}
# Two-step lookup:
def lookup(key):
# Step 1: Hash to find WHICH index page
which_page = hash(key) % 16
# Step 2: Look up in that page
return index_pages[which_page][key]
Why this works: - Each index page is only ~64MB (fits in memory) - Hash instantly tells us which page to check - Still O(1) lookup time!
Optional Extensions
Additional optimizations you can apply to either index type
- Each extension solves a specific problem - use what you need
- Extensions can be combined (e.g., Multi-Column + Bloom on partitioned index)
- Trade-offs: Added complexity for specific performance gains
| Extension | What It Does | How It Works | When to Use | Trade-off |
|---|---|---|---|---|
| Bloom Filters | Probabilistic pre-filter | if bloom.might_contain(k): check_index() |
• Filter non-existent keys • 10-100x size reduction • Used by: Cassandra, HBase |
False positives possible |
| Multi-Column | Index multiple columns | hash(lastname + zip) |
• Compound key queries • WHERE lastname='X' AND zip='Y' |
More space, faster queries |
| Sparse Indexes | Index every Nth value | Data: [1,2,3,4,5...]Index: [1→p0, 5→p1...] |
• Sorted data only • N× space reduction • Time-series DBs |
Small scan needed |
| Adaptive | Build index on demand | if not indexed(k):add_to_index(k) |
• Unknown query patterns • Patterns change over time |
First queries slower |
Common Combinations
| Combination | Use Case | Example System |
|---|---|---|
| Partitioning + Bloom | Keep partitions small + filter non-existent | Cassandra, HBase |
| Partitioning + Multi-Column | Partition compound indexes | Distributed SQL databases |
| Multi-Column + Sparse | Large sorted tables with compound keys | Analytical databases |
| Partitioning + Adaptive | Learn hot partitions over time | Self-tuning databases |
| Sparse + Partitioning | Massive time-series data | ClickHouse, InfluxDB |
How Hash Indexes Work
PSEUDOCODE: Hash Index
// Build hash index
HashIndex(table, key_column):
B = choose_bucket_count() // Often prime number
index = array[B] of lists
for row in table:
key = row[key_column]
bucket = hash(key) % B
index[bucket].append((key, row_pointer))
return index
// Lookup exact match - O(1) average
Lookup(index, search_key):
bucket = hash(search_key) % B
for (key, pointer) in index[bucket]:
if key == search_key:
return pointer // Found!
return null // Not found
When to Use Hash Indexes
✅ Perfect For:
-
Primary key lookups:
WHERE user_id = 12345 -
Foreign key joins:
JOIN ON orders.user_id = users.id -
Exact match filters:
WHERE status = 'ACTIVE' -
In-memory databases: Redis, Memcached
❌ Cannot Handle:
-
Range queries:
WHERE age BETWEEN 18 AND 25 -
Sorting:
ORDER BY created_at -
Prefix matching:
WHERE name LIKE 'John%' -
Inequality:
WHERE price < 100
SQL Examples
-- Hash index WORKS for these:
SELECT * FROM users WHERE user_id = 42;
SELECT * FROM orders WHERE order_id = 'ABC-123';
DELETE FROM sessions WHERE session_key = 'xyz789';
-- Hash index FAILS for these (need B+Tree instead):
SELECT * FROM users WHERE age > 21;
SELECT * FROM orders ORDER BY created_at;
SELECT * FROM products WHERE price < 50;
Performance Comparison
For 1 billion rows:
-
Hash Index: ~1-2 operations
-
Full Table Scan: 1,000,000,000 operations
Real-World Systems
PostgreSQL Hash Indexes
CREATE INDEX idx_user_hash ON users USING HASH (user_id);
-- Warning: Not crash-safe before PostgreSQL 10!
MySQL MEMORY Engine
CREATE TABLE sessions (
session_id VARCHAR(32) PRIMARY KEY,
user_id INT,
data TEXT
) ENGINE=MEMORY; -- Uses hash indexes by default
Redis
# Redis is essentially a giant hash table
redis.set("user:12345", json_data) # O(1)
redis.get("user:12345") # O(1)
Key Takeaways
-
O(1) lookups - Fastest for exact match queries
-
Memory-efficient partitioning - Split large indexes across pages
-
Equality only - Use B+Tree for ranges
-
Real-world usage - Redis, Memcached, database primary keys