Hashing: From Basics to Advanced

The Universal Data Structure Technique

Part 1: Basic Hashing

What is Hashing?

Hashing is a technique to map data of arbitrary size to fixed-size values. Think of it as a deterministic compression function that always maps the same input to the same output.

O(1)

Average Access Time

Deterministic

Same Input → Same Output

Fixed Range

[0, B-1] Output

Uniform

Even Distribution

Integer Hashing

Simple Modulo Hash

The most basic hash function for integers:

PSEUDOCODE: Integer Hashing

// Simple Modulo Hash
hash_modulo(x, B):
    return x % B

// Multiplicative Hash (better distribution)
hash_mult(x, B):
    A = 0.6180339887  // (√5 - 1)/2 - golden ratio
    return floor(B * ((x * A) % 1))

// MurmurHash-style (for production)
hash_murmur(x, B):
    x ^= x >> 16
    x *= 0x85ebca6b
    x ^= x >> 13
    x *= 0xc2b2ae35
    x ^= x >> 16
    return x % B

Visual Example: Integer Hashing

String Hashing

From Characters to Numbers

Strings need to be converted to numbers before hashing. Common approaches:

PSEUDOCODE: String Hashing

// Simple Sum (Bad - many collisions)
hash_sum(str, B):
    sum = 0
    for char in str:
        sum += ASCII(char)
    return sum % B

// Polynomial Rolling Hash (Better)
hash_poly(str, B):
    P = 31    // prime number
    hash = 0
    pow = 1
    for char in str:
        hash = (hash + ASCII(char) * pow) % B
        pow = (pow * P) % B
    return hash

// djb2 Hash (Production Quality)
hash_djb2(str, B):
    hash = 5381
    for char in str:
        hash = ((hash << 5) + hash) + ASCII(char)  // hash * 33 + c
    return hash % B

Visual Example: String Hashing

Hash Collisions

The Birthday Paradox

With just 23 people, there's >50% chance two share a birthday. Similarly, hash collisions are inevitable!

√B

Items for 50% collision chance

Chaining

Linked lists at each bucket

Open Addressing

Find next empty slot

Load Factor

n/B < 0.75 typical

Part 2: Advanced Hashing

Geospatial Hashing

Uber's H3: Hexagonal Hierarchical Spatial Index

Locality Sensitive Hashing (LSH)

Finding Similar Items Fast

LSH is designed so similar items hash to the same buckets with high probability.

PSEUDOCODE: LSH for Similarity Search

// MinHash for Document Similarity
MinHash(document, k):    // k hash functions
    shingles = extract_shingles(document, n=3)
    signature = []

    for i in 0..k-1:
        min_hash = ∞
        for shingle in shingles:
            hash_val = hash_i(shingle)
            min_hash = min(min_hash, hash_val)
        signature.append(min_hash)

    return signature

// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
    hash_bits = []

    for plane in random_planes:
        dot_product = dot(vector, plane.normal)
        if dot_product >= 0:
            hash_bits.append(1)
        else:
            hash_bits.append(0)

    return bits_to_int(hash_bits)

// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
    d = len(vector)
    m = len(lookup_tables)    // number of subvectors
    subvector_dim = d / m

    codes = []
    for i in 0..m-1:
        subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
        nearest_prototype = find_nearest(subvector, lookup_tables[i])
        codes.append(nearest_prototype.id)

    return codes    // compressed representation

What is Quantization?

Intuition: Quantization = Lossy compression that groups similar values together

Think of it like:

Rounding numbers: 3.14159 → 3.14 (lose precision, save space)
Color reduction: 16M colors → 256 colors (like GIF images)
Vector quantization: [0.823, 0.419, ...] → Lookup Table ID #42

What's a "lookup table" in PQ?

Think of it like a color palette with 256 colors
Instead of storing millions of possible vectors...
We pick the closest match from 256 "representative examples"
Each lookup table is like a menu of pre-computed options

Product Quantization (PQ) in Action:

Split: 512-dim vector → 8 chunks of 64 dims each
Map: Each chunk → nearest prototype from lookup table
Store: 8 prototype IDs instead of 512 floats
Compression: 512×4 bytes → 8×1 byte = 256x smaller!

The Trade-off: Lose exact values, but preserve similarity relationships

Visual: LSH for Similar Image Search

Vector/Embedding Hashing

Hashing High-Dimensional Data

Modern ML creates high-dimensional embeddings (512-2048 dims). We need efficient ways to hash them.

Product Quantization

Split vector, quantize parts

Random Projection

Project to lower dims

Learned Hashing

Neural network hash functions

Applications

Search, RecSys, Dedup

Real-World Applications

Facebook (Meta)

Faiss library for billion-scale similarity search
Used for finding similar posts, images, videos

Google

YouTube recommendations via embedding similarity
Image search using visual embeddings

Spotify

Song recommendations via audio embeddings
Playlist generation using LSH

OpenAI

Semantic search over text embeddings
Finding similar code snippets

Key Takeaways

Basic hashing maps any data to fixed range [0, B-1]
Integer/String hashing forms foundation of hash tables
Collisions are inevitable - handle with chaining or open addressing
Geospatial hashing (H3) enables location-based services
LSH finds similar items without comparing everything
Vector hashing makes ML embeddings searchable at scale

The Punchline: Hashing turns infinite spaces into finite, manageable buckets while preserving important properties like exact lookups, similarity or locality!