Hashing: From Data to Algorithms

Concept. A hash function maps a key to a fixed-size bucket id, so you can find or insert in expected O(1) by reading the page where the bucket lives.

Intuition. Hash-index Listens by user_id, and finding Mickey's 3 listens inside a billion-row table costs one IO, the page where Mickey's bucket lives. No scan, no sort.

The Bridge from Hardware to Algorithms

We just spent an entire chapter proving that disks are slow and network access is costly. You now understand the physics!

For the rest of this course, we are going to use Algorithms and Data Structures to abstract those physical bottlenecks.

Ignore device differences: Focus on IO patterns.
Count IOs, not milliseconds:
The Goal: An algorithm that reads 100 pages costs 100 IOs. An algorithm that reads a million pages costs 1,000,000 IOs.

When we say "Algorithm A costs 5N IOs" vs "Algorithm B costs N log N IOs", we are comparing how many times they hit the disk. Hashing is our first weapon to bring those IO costs down to exactly 1 IO.

Part 1: Basic Hashing

What is Hashing?

Hashing is your go-to method for turning any data size into a neat, fixed-size package. It's a deterministic compression function, ensuring that the same input always results in the same output.

O(1)

Average Access Time

Deterministic

Same Input → Same Output

Fixed Range

[0, B-1] Output

Uniform

Even Distribution

Integer Hashing

Simple Modulo Hash

The simplest way to hash integers:

// Simple Modulo Hash
hash_modulo(x, B):
    return x % B

// MurmurHash-style (for better distribution in production)
hash_murmur(x, B):
    x ^= x >> 16
    x *= 0x85ebca6b
    x ^= x >> 13
    x *= 0xc2b2ae35
    x ^= x >> 16
    return x % B

Visual Example: Integer Hashing

String Hashing

From Characters to Numbers

Strings must first become numbers before they can be hashed. Here's how:

// Simple Sum (Bad - many collisions)
hash_sum(str, B):
    sum = 0
    for char in str:
        sum += ASCII(char)
    return sum % B

// Polynomial Rolling Hash (Better)
hash_poly(str, B):
    P = 31    // prime number
    hash = 0
    pow = 1
    for char in str:
        hash = (hash + ASCII(char) * pow) % B
        pow = (pow * P) % B
    return hash

Visual Example: String Hashing

Hash Collisions

The Birthday Paradox

Just like 23 people in a room make a birthday match likely, hash collisions are unavoidable.

√B

Items for 50% collision chance

Chaining

Linked lists at each bucket

Open Addressing

Find next empty slot

Load Factor

n/B < 0.75 typical

Part 2: Advanced Hashing

Locality Sensitive Hashing (LSH)

Finding Similar Items Fast

LSH is your tool for ensuring similar items end up in the same bucket more often than not.

// MinHash for Document Similarity
MinHash(document, k):    // k hash functions
    shingles = extract_shingles(document, n=3)
    signature = []

    for i in 0..k-1:
        min_hash = ∞
        for shingle in shingles:
            hash_val = hash_i(shingle)
            min_hash = min(min_hash, hash_val)
        signature.append(min_hash)

    return signature

// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
    hash_bits = []

    for plane in random_planes:
        dot_product = dot(vector, plane.normal)
        if dot_product >= 0:
            hash_bits.append(1)
        else:
            hash_bits.append(0)

    return bits_to_int(hash_bits)

// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
    d = len(vector)
    m = len(lookup_tables)    // number of subvectors
    subvector_dim = d / m

    codes = []
    for i in 0..m-1:
        subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
        nearest_prototype = find_nearest(subvector, lookup_tables[i])
        codes.append(nearest_prototype.id)

    return codes    // compressed representation

What is Quantization?

Quantization compresses data by grouping similar values:

Rounding numbers: 3.14159 → 3.14
Color reduction: 16M colors → 256 colors
Vector quantization: [0.823, 0.419, ...] → Lookup Table ID #42

Example: Product Quantization (PQ) in Action:

Split: 512-dim vector → 8 chunks of 64 dims each
Map: Each chunk → nearest prototype from lookup table
Store: 8 prototype IDs instead of 512 floats
Compression: 512×4 bytes → 8×1 byte = 256x smaller!

The Trade-off: You lose exact values but keep the important similarities.

Visual: LSH for Similar Image Search

Figure

Locality-Sensitive Hashing for image similarity. Similar inputs land in the same bucket.

Input images (left): four images. Cat 1 and Cat 2 are visually similar; Dog and Car are different.
Feature vectors: each image becomes a high-dimensional embedding (here 512-D, in real systems often larger). Similar images have similar vectors.
Random hyperplanes: pick four random planes through the embedding space. For each plane, compute one bit per vector: 1 if above the plane, 0 if below. Four planes → 4-bit hash.
Hash values: Cat 1 and Cat 2 both hash to 1011 (bucket 11) because they sit on the same side of every plane. Dog hashes to 1001 (bucket 9), Car to 0100 (bucket 4).
Why this matters: a "find similar to Cat" query checks only bucket 11 instead of comparing against every image in the database. The cost goes from O(N) to roughly O(N/2^k) with k hyperplanes, at the price of occasional false negatives when truly similar items happen to fall on opposite sides of a plane.

Vector/Embedding Hashing

Hashing High-Dimensional Data

Machine learning churns out high-dimensional embeddings. Here's how to hash them efficiently.

Product Quantization

Split vector, quantize parts

Random Projection

Project to lower dims

Learned Hashing

Neural network hash functions

Real-World Applications

Facebook (Meta)

Faiss library for billion-scale similarity search
Used for finding similar posts, images, videos

Google

YouTube recommendations via embedding similarity
Image search using visual embeddings

Spotify

Song recommendations via audio embeddings
Playlist generation using LSH

OpenAI

Semantic search over text embeddings
Finding similar code snippets

Key Takeaways

Basic hashing maps any data to a fixed range [0, B-1].
Integer/String hashing forms the backbone of hash tables.
Collisions are inevitable; handle them with chaining or open addressing.
Geospatial hashing (H3) powers location-based services.
LSH finds similar items without exhaustive comparisons.
Vector hashing makes ML embeddings searchable at scale.

The Punchline: Hashing transforms infinite spaces into finite, manageable buckets while preserving crucial properties like exact lookups and similarity.