Hashing: From Basics to Advanced
The Universal Data Structure Technique
Part 1: Basic Hashing
What is Hashing?
Hashing is a technique to map data of arbitrary size to fixed-size values. Think of it as a deterministic compression function that always maps the same input to the same output.
Integer Hashing
Simple Modulo Hash
The most basic hash function for integers:
PSEUDOCODE: Integer Hashing
// Simple Modulo Hash
hash_modulo(x, B):
return x % B
// Multiplicative Hash (better distribution)
hash_mult(x, B):
A = 0.6180339887 // (β5 - 1)/2 - golden ratio
return floor(B * ((x * A) % 1))
// MurmurHash-style (for production)
hash_murmur(x, B):
x ^= x >> 16
x *= 0x85ebca6b
x ^= x >> 13
x *= 0xc2b2ae35
x ^= x >> 16
return x % B
Visual Example: Integer Hashing
String Hashing
From Characters to Numbers
Strings need to be converted to numbers before hashing. Common approaches:
PSEUDOCODE: String Hashing
// Simple Sum (Bad - many collisions)
hash_sum(str, B):
sum = 0
for char in str:
sum += ASCII(char)
return sum % B
// Polynomial Rolling Hash (Better)
hash_poly(str, B):
P = 31 // prime number
hash = 0
pow = 1
for char in str:
hash = (hash + ASCII(char) * pow) % B
pow = (pow * P) % B
return hash
// djb2 Hash (Production Quality)
hash_djb2(str, B):
hash = 5381
for char in str:
hash = ((hash << 5) + hash) + ASCII(char) // hash * 33 + c
return hash % B
Visual Example: String Hashing
Hash Collisions
The Birthday Paradox
With just 23 people, there's >50% chance two share a birthday. Similarly, hash collisions are inevitable!
Part 2: Advanced Hashing
Geospatial Hashing
Uber's H3: Hexagonal Hierarchical Spatial Index
Locality Sensitive Hashing (LSH)
Finding Similar Items Fast
LSH is designed so similar items hash to the same buckets with high probability.
PSEUDOCODE: LSH for Similarity Search
// MinHash for Document Similarity
MinHash(document, k): // k hash functions
shingles = extract_shingles(document, n=3)
signature = []
for i in 0..k-1:
min_hash = β
for shingle in shingles:
hash_val = hash_i(shingle)
min_hash = min(min_hash, hash_val)
signature.append(min_hash)
return signature
// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
hash_bits = []
for plane in random_planes:
dot_product = dot(vector, plane.normal)
if dot_product >= 0:
hash_bits.append(1)
else:
hash_bits.append(0)
return bits_to_int(hash_bits)
// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
d = len(vector)
m = len(lookup_tables) // number of subvectors
subvector_dim = d / m
codes = []
for i in 0..m-1:
subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
nearest_prototype = find_nearest(subvector, lookup_tables[i])
codes.append(nearest_prototype.id)
return codes // compressed representation
What is Quantization?
Intuition: Quantization = Lossy compression that groups similar values together
Think of it like:
-
Rounding numbers: 3.14159 β 3.14 (lose precision, save space)
-
Color reduction: 16M colors β 256 colors (like GIF images)
-
Vector quantization: [0.823, 0.419, ...] β Lookup Table ID #42
What's a "lookup table" in PQ?
-
Think of it like a color palette with 256 colors
-
Instead of storing millions of possible vectors...
-
We pick the closest match from 256 "representative examples"
-
Each lookup table is like a menu of pre-computed options
Product Quantization (PQ) in Action:
-
Split: 512-dim vector β 8 chunks of 64 dims each
-
Map: Each chunk β nearest prototype from lookup table
-
Store: 8 prototype IDs instead of 512 floats
-
Compression: 512Γ4 bytes β 8Γ1 byte = 256x smaller!
The Trade-off: Lose exact values, but preserve similarity relationships
Visual: LSH for Similar Image Search
Vector/Embedding Hashing
Hashing High-Dimensional Data
Modern ML creates high-dimensional embeddings (512-2048 dims). We need efficient ways to hash them.
Real-World Applications
Facebook (Meta)
-
Faiss library for billion-scale similarity search
-
Used for finding similar posts, images, videos
-
YouTube recommendations via embedding similarity
-
Image search using visual embeddings
Spotify
-
Song recommendations via audio embeddings
-
Playlist generation using LSH
OpenAI
-
Semantic search over text embeddings
-
Finding similar code snippets
Key Takeaways
-
Basic hashing maps any data to fixed range [0, B-1]
-
Integer/String hashing forms foundation of hash tables
-
Collisions are inevitable - handle with chaining or open addressing
-
Geospatial hashing (H3) enables location-based services
-
LSH finds similar items without comparing everything
-
Vector hashing makes ML embeddings searchable at scale
The Punchline: Hashing turns infinite spaces into finite, manageable buckets while preserving important properties like exact lookups, similarity or locality!