Hashing: From Data to Algorithms

Concept. A hash function maps a key to a fixed-size bucket id, so you can find or insert in expected O(1) by reading the page where the bucket lives.

Intuition. Hash-index Listens by user_id, and finding Mickey's 3 listens inside a billion-row table costs one IO, the page where Mickey's bucket lives. No scan, no sort.

The Bridge from Hardware to Algorithms

We just spent an entire chapter proving that disks are slow and network access is costly. You now understand the physics!

For the rest of this course, we are going to use Algorithms and Data Structures to abstract those physical bottlenecks.

  • Ignore device differences: Focus on IO patterns.

  • Count IOs, not milliseconds:

  • The Goal: An algorithm that reads 100 pages costs 100 IOs. An algorithm that reads a million pages costs 1,000,000 IOs.

When we say "Algorithm A costs 5N IOs" vs "Algorithm B costs N log N IOs", we are comparing how many times they hit the disk. Hashing is our first weapon to bring those IO costs down to exactly 1 IO.


Part 1: Basic Hashing

What is Hashing?

Hashing is your go-to method for turning any data size into a neat, fixed-size package. It's a deterministic compression function, ensuring that the same input always results in the same output.

O(1)

Average Access Time

Deterministic

Same Input → Same Output

Fixed Range

[0, B-1] Output

Uniform

Even Distribution


Integer Hashing

Simple Modulo Hash

The simplest way to hash integers:

// Simple Modulo Hash
hash_modulo(x, B):
    return x % B

// MurmurHash-style (for better distribution in production)
hash_murmur(x, B):
    x ^= x >> 16
    x *= 0x85ebca6b
    x ^= x >> 13
    x *= 0xc2b2ae35
    x ^= x >> 16
    return x % B

Visual Example: Integer Hashing

Example: h(x) = x % 10 Input Integers: 42 157 1042 89 Hash Calculation: 42 % 10 = 2 157 % 10 = 7 1042 % 10 = 2 89 % 10 = 9 Hash Buckets [0-9]: 0 1 2 42 1042 3 4 5 6 7 157 8 9 89 Collision: 42 and 1042 both hash to bucket 2

String Hashing

From Characters to Numbers

Strings must first become numbers before they can be hashed. Here's how:

// Simple Sum (Bad - many collisions)
hash_sum(str, B):
    sum = 0
    for char in str:
        sum += ASCII(char)
    return sum % B

// Polynomial Rolling Hash (Better)
hash_poly(str, B):
    P = 31    // prime number
    hash = 0
    pow = 1
    for char in str:
        hash = (hash + ASCII(char) * pow) % B
        pow = (pow * P) % B
    return hash

Visual Example: String Hashing


Hash Collisions

The Birthday Paradox

Just like 23 people in a room make a birthday match likely, hash collisions are unavoidable.

√B

Items for 50% collision chance

Chaining

Linked lists at each bucket

Open Addressing

Find next empty slot

Load Factor

n/B < 0.75 typical


Part 2: Advanced Hashing

Locality Sensitive Hashing (LSH)

Finding Similar Items Fast

LSH is your tool for ensuring similar items end up in the same bucket more often than not.

// MinHash for Document Similarity
MinHash(document, k):    // k hash functions
    shingles = extract_shingles(document, n=3)
    signature = []

    for i in 0..k-1:
        min_hash = ∞
        for shingle in shingles:
            hash_val = hash_i(shingle)
            min_hash = min(min_hash, hash_val)
        signature.append(min_hash)

    return signature

// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
    hash_bits = []

    for plane in random_planes:
        dot_product = dot(vector, plane.normal)
        if dot_product >= 0:
            hash_bits.append(1)
        else:
            hash_bits.append(0)

    return bits_to_int(hash_bits)

// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
    d = len(vector)
    m = len(lookup_tables)    // number of subvectors
    subvector_dim = d / m

    codes = []
    for i in 0..m-1:
        subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
        nearest_prototype = find_nearest(subvector, lookup_tables[i])
        codes.append(nearest_prototype.id)

    return codes    // compressed representation

What is Quantization?

Quantization compresses data by grouping similar values:

  • Rounding numbers: 3.14159 → 3.14

  • Color reduction: 16M colors → 256 colors

  • Vector quantization: [0.823, 0.419, ...] → Lookup Table ID #42

Example: Product Quantization (PQ) in Action:

  1. Split: 512-dim vector → 8 chunks of 64 dims each

  2. Map: Each chunk → nearest prototype from lookup table

  3. Store: 8 prototype IDs instead of 512 floats

  4. Compression: 512×4 bytes → 8×1 byte = 256x smaller!

The Trade-off: You lose exact values but keep the important similarities.


Visual: LSH for Similar Image Search

LSH: Similar Images → Same Hash Buckets Input Images Cat Cat 1 Cat' Cat 2 Dog Dog Car Car Feature Vectors Cat1: [0.8, 0.7, 0.9, ...] Cat2: [0.7, 0.8, 0.8, ...] Dog: [0.3, 0.9, 0.2, ...] Car: [0.1, 0.0, 0.5, ...] 512-dimensional embeddings Random Hyperplanes 4 Random Planes C1 C2 D Car Each plane creates a bit: Above plane = 1, Below = 0 LSH Hash Values Cat1: 1011 → Bucket 11 Cat2: 1011 → Bucket 11 Dog: 1001 → Bucket 9 Car: 0100 → Bucket 4 Similar → Same bucket! Query Results Query: Find similar to Cat Fast: Check bucket 11 only Found: Cat 2 (similar) Skipped: Dog (bucket 9) Skipped: Car (bucket 4) No need to compare with all images in database!

Vector/Embedding Hashing

Hashing High-Dimensional Data

Machine learning churns out high-dimensional embeddings. Here's how to hash them efficiently.

Product Quantization

Split vector, quantize parts

Random Projection

Project to lower dims

Learned Hashing

Neural network hash functions

Real-World Applications

Facebook (Meta)

  • Faiss library for billion-scale similarity search

  • Used for finding similar posts, images, videos

Google

  • YouTube recommendations via embedding similarity

  • Image search using visual embeddings

Spotify

  • Song recommendations via audio embeddings

  • Playlist generation using LSH

OpenAI

  • Semantic search over text embeddings

  • Finding similar code snippets


Key Takeaways

  1. Basic hashing maps any data to a fixed range [0, B-1].

  2. Integer/String hashing forms the backbone of hash tables.

  3. Collisions are inevitable; handle them with chaining or open addressing.

  4. Geospatial hashing (H3) powers location-based services.

  5. LSH finds similar items without exhaustive comparisons.

  6. Vector hashing makes ML embeddings searchable at scale.

The Punchline: Hashing transforms infinite spaces into finite, manageable buckets while preserving crucial properties like exact lookups and similarity.