Hash Indexes
O(1) Lookups for Equality Queries
Hash indexes are the fastest way to find exact matches in a database.
Prerequisites: Read Index Fundamentals to understand data pages, index pages, and pointers.
Visual: Physical Layout - Basic Index Structure
Two Ways to Organize Hash Indexes
Basic Hash Index
What it stores: (value, page#) pairs
# ONE giant dictionary
index = {
"Alice": 2, # Alice is in page 2
"Bob": 0, # Bob is in page 0
"Carol": 0, # Carol is in page 0
"Dan": [0,1,4], # Dan in pages 0,1,4
# ... millions more entries
}
# Lookup is direct:
page_num = index["Alice"] # Returns 2
The Problem: - 1 billion entries × 16 bytes each = 16 GB - More values? Won't fit in memory! - Need a different approach...
Partitioned Hash Index (Default)
What it stores: Same (value, page#) pairs
But: Hash Partitioned across N smaller index pages
# Split into N=16 smaller indexes
index_page_0 = {"Alice": 2, ...} # ~64MB
index_page_1 = {"Bob": 0, ...} # ~64MB
index_page_2 = {"Carol": 0, ...} # ~64MB
...
index_page_15 = {"Dan": [0,1,4], ...}
# Two-step lookup:
def lookup(key):
# Step 1: Hash to find WHICH index page
which_page = hash(key) % 16
# Step 2: Look up in that page
return index_pages[which_page][key]
Why this works: - Each index page is only ~64MB (fits in memory) - Hash instantly tells us which page to check - Still O(1) lookup time!
How Hash Indexes Work
PSEUDOCODE: Hash Index
// Build hash index
HashIndex(table, key_column):
B = choose_bucket_count() // Often prime number
index = array[B] of lists
for row in table:
key = row[key_column]
bucket = hash(key) % B
index[bucket].append((key, row_pointer))
return index
// Lookup exact match - O(1) average
Lookup(index, search_key):
bucket = hash(search_key) % B
for (key, pointer) in index[bucket]:
if key == search_key:
return pointer // Found!
return null // Not found
SQL Examples
-- Hash index WORKS for these:
SELECT * FROM users WHERE user_id = 42;
SELECT * FROM orders WHERE order_id = 'ABC-123';
DELETE FROM sessions WHERE session_key = 'xyz789';
-- Hash index FAILS for these (need B+Tree instead):
SELECT * FROM users WHERE age > 21;
SELECT * FROM orders ORDER BY created_at;
SELECT * FROM products WHERE price < 50;
Performance Comparison
For 1 billion rows:
-
Hash Index: ~1-2 operations
-
Full Table Scan: 1,000,000,000 operations
Optional Extensions
| Extension | What It Does | How It Works | When to Use | Trade-off |
|---|---|---|---|---|
| Multi-Column | Index multiple columns | hash(lastname + zip) |
• Compound key queries • WHERE lastname='X' AND zip='Y' |
More space, faster queries |
| Sparse Indexes | Index every Nth value | Data: [1,2,3,4,5...]Index: [1→p0, 5→p1...] |
• Sorted data only • N× space reduction • Time-series DBs |
Small scan needed |
| --- |
Key Takeaways
-
O(1) lookups - Fastest for exact match queries
-
Memory-efficient partitioning - Split large indexes across pages
-
Real-world usage - Redis, Memcached, database primary keys