Hashing: From Basics to Advanced
The Universal Data Structure Technique
Part 1: Basic Hashing
What is Hashing?
Hashing is your go-to method for turning any data size into a neat, fixed-size package. It's a deterministic compression function, ensuring that the same input always results in the same output.
Integer Hashing
Simple Modulo Hash
The simplest way to hash integers:
// Simple Modulo Hash
hash_modulo(x, B):
return x % B
// MurmurHash-style (for better distribution in production)
hash_murmur(x, B):
x ^= x >> 16
x *= 0x85ebca6b
x ^= x >> 13
x *= 0xc2b2ae35
x ^= x >> 16
return x % B
Visual Example: Integer Hashing
String Hashing
From Characters to Numbers
Strings must first become numbers before they can be hashed. Here's how:
// Simple Sum (Bad - many collisions)
hash_sum(str, B):
sum = 0
for char in str:
sum += ASCII(char)
return sum % B
// Polynomial Rolling Hash (Better)
hash_poly(str, B):
P = 31 // prime number
hash = 0
pow = 1
for char in str:
hash = (hash + ASCII(char) * pow) % B
pow = (pow * P) % B
return hash
Visual Example: String Hashing
Hash Collisions
The Birthday Paradox
Just like 23 people in a room make a birthday match likely, hash collisions are unavoidable.
Part 2: Advanced Hashing
Geospatial Hashing
Uber's H3: Hexagonal Hierarchical Spatial Index
Locality Sensitive Hashing (LSH)
Finding Similar Items Fast
LSH is your tool for ensuring similar items end up in the same bucket more often than not.
// MinHash for Document Similarity
MinHash(document, k): // k hash functions
shingles = extract_shingles(document, n=3)
signature = []
for i in 0..k-1:
min_hash = ∞
for shingle in shingles:
hash_val = hash_i(shingle)
min_hash = min(min_hash, hash_val)
signature.append(min_hash)
return signature
// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
hash_bits = []
for plane in random_planes:
dot_product = dot(vector, plane.normal)
if dot_product >= 0:
hash_bits.append(1)
else:
hash_bits.append(0)
return bits_to_int(hash_bits)
// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
d = len(vector)
m = len(lookup_tables) // number of subvectors
subvector_dim = d / m
codes = []
for i in 0..m-1:
subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
nearest_prototype = find_nearest(subvector, lookup_tables[i])
codes.append(nearest_prototype.id)
return codes // compressed representation
What is Quantization?
Quantization compresses data by grouping similar values:
-
Rounding numbers: 3.14159 → 3.14
-
Color reduction: 16M colors → 256 colors
-
Vector quantization: [0.823, 0.419, ...] → Lookup Table ID #42
Example: Product Quantization (PQ) in Action:
-
Split: 512-dim vector → 8 chunks of 64 dims each
-
Map: Each chunk → nearest prototype from lookup table
-
Store: 8 prototype IDs instead of 512 floats
-
Compression: 512×4 bytes → 8×1 byte = 256x smaller!
The Trade-off: You lose exact values but keep the important similarities.
Visual: LSH for Similar Image Search
Vector/Embedding Hashing
Hashing High-Dimensional Data
Machine learning churns out high-dimensional embeddings. Here's how to hash them efficiently.
Real-World Applications
Facebook (Meta)
-
Faiss library for billion-scale similarity search
-
Used for finding similar posts, images, videos
-
YouTube recommendations via embedding similarity
-
Image search using visual embeddings
Spotify
-
Song recommendations via audio embeddings
-
Playlist generation using LSH
OpenAI
-
Semantic search over text embeddings
-
Finding similar code snippets
Key Takeaways
-
Basic hashing maps any data to a fixed range [0, B-1].
-
Integer/String hashing forms the backbone of hash tables.
-
Collisions are inevitable; handle them with chaining or open addressing.
-
Geospatial hashing (H3) powers location-based services.
-
LSH finds similar items without exhaustive comparisons.
-
Vector hashing makes ML embeddings searchable at scale.
The Punchline: Hashing transforms infinite spaces into finite, manageable buckets while preserving crucial properties like exact lookups and similarity.