Hashing: From Data to Algorithms
Concept. A hash function maps a key to a fixed-size bucket id, so you can find or insert in expected O(1) by reading the page where the bucket lives.
Intuition. Hash-index Listens by user_id, and finding Mickey's 3 listens inside a billion-row table costs one IO, the page where Mickey's bucket lives. No scan, no sort.
The Bridge from Hardware to Algorithms
We just spent an entire chapter proving that disks are slow and network access is costly. You now understand the physics!
For the rest of this course, we are going to use Algorithms and Data Structures to abstract those physical bottlenecks.
-
Ignore device differences: Focus on IO patterns.
-
Count IOs, not milliseconds:
-
The Goal: An algorithm that reads 100 pages costs 100 IOs. An algorithm that reads a million pages costs 1,000,000 IOs.
When we say "Algorithm A costs 5N IOs" vs "Algorithm B costs N log N IOs", we are comparing how many times they hit the disk. Hashing is our first weapon to bring those IO costs down to exactly 1 IO.
Part 1: Basic Hashing
What is Hashing?
Hashing is your go-to method for turning any data size into a neat, fixed-size package. It's a deterministic compression function, ensuring that the same input always results in the same output.
O(1)
Average Access Time
Deterministic
Same Input → Same Output
Fixed Range
[0, B-1] Output
Uniform
Even Distribution
Integer Hashing
Simple Modulo Hash
The simplest way to hash integers:
// Simple Modulo Hash
hash_modulo(x, B):
return x % B
// MurmurHash-style (for better distribution in production)
hash_murmur(x, B):
x ^= x >> 16
x *= 0x85ebca6b
x ^= x >> 13
x *= 0xc2b2ae35
x ^= x >> 16
return x % B
Visual Example: Integer Hashing
String Hashing
From Characters to Numbers
Strings must first become numbers before they can be hashed. Here's how:
// Simple Sum (Bad - many collisions)
hash_sum(str, B):
sum = 0
for char in str:
sum += ASCII(char)
return sum % B
// Polynomial Rolling Hash (Better)
hash_poly(str, B):
P = 31 // prime number
hash = 0
pow = 1
for char in str:
hash = (hash + ASCII(char) * pow) % B
pow = (pow * P) % B
return hash
Visual Example: String Hashing
Hash Collisions
The Birthday Paradox
Just like 23 people in a room make a birthday match likely, hash collisions are unavoidable.
√B
Items for 50% collision chance
Chaining
Linked lists at each bucket
Open Addressing
Find next empty slot
Load Factor
n/B < 0.75 typical
Part 2: Advanced Hashing
Locality Sensitive Hashing (LSH)
Finding Similar Items Fast
LSH is your tool for ensuring similar items end up in the same bucket more often than not.
// MinHash for Document Similarity
MinHash(document, k): // k hash functions
shingles = extract_shingles(document, n=3)
signature = []
for i in 0..k-1:
min_hash = ∞
for shingle in shingles:
hash_val = hash_i(shingle)
min_hash = min(min_hash, hash_val)
signature.append(min_hash)
return signature
// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
hash_bits = []
for plane in random_planes:
dot_product = dot(vector, plane.normal)
if dot_product >= 0:
hash_bits.append(1)
else:
hash_bits.append(0)
return bits_to_int(hash_bits)
// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
d = len(vector)
m = len(lookup_tables) // number of subvectors
subvector_dim = d / m
codes = []
for i in 0..m-1:
subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
nearest_prototype = find_nearest(subvector, lookup_tables[i])
codes.append(nearest_prototype.id)
return codes // compressed representation
What is Quantization?
Quantization compresses data by grouping similar values:
-
Rounding numbers: 3.14159 → 3.14
-
Color reduction: 16M colors → 256 colors
-
Vector quantization: [0.823, 0.419, ...] → Lookup Table ID #42
Example: Product Quantization (PQ) in Action:
-
Split: 512-dim vector → 8 chunks of 64 dims each
-
Map: Each chunk → nearest prototype from lookup table
-
Store: 8 prototype IDs instead of 512 floats
-
Compression: 512×4 bytes → 8×1 byte = 256x smaller!
The Trade-off: You lose exact values but keep the important similarities.
Visual: LSH for Similar Image Search
Vector/Embedding Hashing
Hashing High-Dimensional Data
Machine learning churns out high-dimensional embeddings. Here's how to hash them efficiently.
Product Quantization
Split vector, quantize parts
Random Projection
Project to lower dims
Learned Hashing
Neural network hash functions
Real-World Applications
Facebook (Meta)
-
Faiss library for billion-scale similarity search
-
Used for finding similar posts, images, videos
-
YouTube recommendations via embedding similarity
-
Image search using visual embeddings
Spotify
-
Song recommendations via audio embeddings
-
Playlist generation using LSH
OpenAI
-
Semantic search over text embeddings
-
Finding similar code snippets
Key Takeaways
-
Basic hashing maps any data to a fixed range [0, B-1].
-
Integer/String hashing forms the backbone of hash tables.
-
Collisions are inevitable; handle them with chaining or open addressing.
-
Geospatial hashing (H3) powers location-based services.
-
LSH finds similar items without exhaustive comparisons.
-
Vector hashing makes ML embeddings searchable at scale.
The Punchline: Hashing transforms infinite spaces into finite, manageable buckets while preserving crucial properties like exact lookups and similarity.