Hashing: From Basics to Advanced

The Universal Data Structure Technique


Part 1: Basic Hashing

What is Hashing?

Hashing is your go-to method for turning any data size into a neat, fixed-size package. It's a deterministic compression function, ensuring that the same input always results in the same output.

O(1)
Average Access Time
Deterministic
Same Input → Same Output
Fixed Range
[0, B-1] Output
Uniform
Even Distribution

Integer Hashing

Simple Modulo Hash

The simplest way to hash integers:

// Simple Modulo Hash
hash_modulo(x, B):
    return x % B

// MurmurHash-style (for better distribution in production)
hash_murmur(x, B):
    x ^= x >> 16
    x *= 0x85ebca6b
    x ^= x >> 13
    x *= 0xc2b2ae35
    x ^= x >> 16
    return x % B

Visual Example: Integer Hashing


String Hashing

From Characters to Numbers

Strings must first become numbers before they can be hashed. Here's how:

// Simple Sum (Bad - many collisions)
hash_sum(str, B):
    sum = 0
    for char in str:
        sum += ASCII(char)
    return sum % B

// Polynomial Rolling Hash (Better)
hash_poly(str, B):
    P = 31    // prime number
    hash = 0
    pow = 1
    for char in str:
        hash = (hash + ASCII(char) * pow) % B
        pow = (pow * P) % B
    return hash

Visual Example: String Hashing


Hash Collisions

The Birthday Paradox

Just like 23 people in a room make a birthday match likely, hash collisions are unavoidable.

√B
Items for 50% collision chance
Chaining
Linked lists at each bucket
Open Addressing
Find next empty slot
Load Factor
n/B < 0.75 typical

Part 2: Advanced Hashing

Geospatial Hashing

Uber's H3: Hexagonal Hierarchical Spatial Index


Locality Sensitive Hashing (LSH)

Finding Similar Items Fast

LSH is your tool for ensuring similar items end up in the same bucket more often than not.

// MinHash for Document Similarity
MinHash(document, k):    // k hash functions
    shingles = extract_shingles(document, n=3)
    signature = []

    for i in 0..k-1:
        min_hash = ∞
        for shingle in shingles:
            hash_val = hash_i(shingle)
            min_hash = min(min_hash, hash_val)
        signature.append(min_hash)

    return signature

// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
    hash_bits = []

    for plane in random_planes:
        dot_product = dot(vector, plane.normal)
        if dot_product >= 0:
            hash_bits.append(1)
        else:
            hash_bits.append(0)

    return bits_to_int(hash_bits)

// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
    d = len(vector)
    m = len(lookup_tables)    // number of subvectors
    subvector_dim = d / m

    codes = []
    for i in 0..m-1:
        subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
        nearest_prototype = find_nearest(subvector, lookup_tables[i])
        codes.append(nearest_prototype.id)

    return codes    // compressed representation

What is Quantization?

Quantization compresses data by grouping similar values:

Example: Product Quantization (PQ) in Action:

  1. Split: 512-dim vector → 8 chunks of 64 dims each

  2. Map: Each chunk → nearest prototype from lookup table

  3. Store: 8 prototype IDs instead of 512 floats

  4. Compression: 512×4 bytes → 8×1 byte = 256x smaller!

The Trade-off: You lose exact values but keep the important similarities.


Visual: LSH for Similar Image Search


Vector/Embedding Hashing

Hashing High-Dimensional Data

Machine learning churns out high-dimensional embeddings. Here's how to hash them efficiently.

Product Quantization
Split vector, quantize parts
Random Projection
Project to lower dims
Learned Hashing
Neural network hash functions

Real-World Applications

Facebook (Meta)

Google

Spotify

OpenAI


Key Takeaways

  1. Basic hashing maps any data to a fixed range [0, B-1].

  2. Integer/String hashing forms the backbone of hash tables.

  3. Collisions are inevitable; handle them with chaining or open addressing.

  4. Geospatial hashing (H3) powers location-based services.

  5. LSH finds similar items without exhaustive comparisons.

  6. Vector hashing makes ML embeddings searchable at scale.

The Punchline: Hashing transforms infinite spaces into finite, manageable buckets while preserving crucial properties like exact lookups and similarity.