Hashing: From Basics to Advanced

The Universal Data Structure Technique

Hash Functions: Mapping Infinity to Finite Input Space (∞) 42 (integer) "hello world" (string) (37.7749, -122.4194) (geo) [0.1, 0.5, ..., 0.9] (vector) {complex object} (any type) h(x) β†’ [0, B-1] Properties: βœ“ Deterministic: Same input β†’ Same output βœ“ Uniform: Even distribution βœ“ Fast: O(1) computation βœ“ Fixed Range: [0, B-1] Hash Space [0, B-1] Bucket 0: 3 items Bucket 1: 1 item Bucket 2: 2 items ... Bucket B-1: 4 items

Part 1: Basic Hashing

What is Hashing?

Hashing is a technique to map data of arbitrary size to fixed-size values. Think of it as a deterministic compression function that always maps the same input to the same output.

O(1)
Average Access Time
Deterministic
Same Input β†’ Same Output
Fixed Range
[0, B-1] Output
Uniform
Even Distribution

Integer Hashing

Simple Modulo Hash

The most basic hash function for integers:

PSEUDOCODE: Integer Hashing

// Simple Modulo Hash
hash_modulo(x, B):
    return x % B

// Multiplicative Hash (better distribution)
hash_mult(x, B):
    A = 0.6180339887  // (√5 - 1)/2 - golden ratio
    return floor(B * ((x * A) % 1))

// MurmurHash-style (for production)
hash_murmur(x, B):
    x ^= x >> 16
    x *= 0x85ebca6b
    x ^= x >> 13
    x *= 0xc2b2ae35
    x ^= x >> 16
    return x % B

Visual Example: Integer Hashing

Example: h(x) = x % 10 Input Integers: 42 157 1042 89 Hash Calculation: 42 % 10 = 2 157 % 10 = 7 1042 % 10 = 2 89 % 10 = 9 Hash Buckets [0-9]: 0 1 2 42 1042 3 4 5 6 7 157 8 9 89 ⚠️ Collision: 42 and 1042 both hash to bucket 2

String Hashing

From Characters to Numbers

Strings need to be converted to numbers before hashing. Common approaches:

PSEUDOCODE: String Hashing

// Simple Sum (Bad - many collisions)
hash_sum(str, B):
    sum = 0
    for char in str:
        sum += ASCII(char)
    return sum % B

// Polynomial Rolling Hash (Better)
hash_poly(str, B):
    P = 31    // prime number
    hash = 0
    pow = 1
    for char in str:
        hash = (hash + ASCII(char) * pow) % B
        pow = (pow * P) % B
    return hash

// djb2 Hash (Production Quality)
hash_djb2(str, B):
    hash = 5381
    for char in str:
        hash = ((hash << 5) + hash) + ASCII(char)  // hash * 33 + c
    return hash % B

Visual Example: String Hashing

Polynomial Hash: "hello" β†’ hash value String: "hello" 'h' 104 'e' 101 'l' 108 'l' 108 'o' 111 Polynomial Hash (P=31, B=100): hash = 0, pow = 1 'h': hash = (0 + 104*1) % 100 = 4, pow = 31 'e': hash = (4 + 101*31) % 100 = 35, pow = 61 'l': hash = (35 + 108*61) % 100 = 23, pow = 91 'l': hash = (23 + 108*91) % 100 = 51, pow = 21 'o': hash = (51 + 111*21) % 100 = 82 Final: h("hello") = 82 Different Strings, Different Hashes: "hello" β†’ 82 "world" β†’ 13 "Hello" β†’ 51 // Case sensitive! "olleh" β†’ 29 // Order matters! "hi" β†’ 71 String Hash Properties: βœ“ Order sensitive βœ“ Case sensitive βœ“ Length affects hash βœ“ Anagrams have different hashes βœ“ Good distribution with prime P

Hash Collisions

The Birthday Paradox

With just 23 people, there's >50% chance two share a birthday. Similarly, hash collisions are inevitable!

√B
Items for 50% collision chance
Chaining
Linked lists at each bucket
Open Addressing
Find next empty slot
Load Factor
n/B < 0.75 typical

Part 2: Advanced Hashing

Geospatial Hashing

Uber's H3: Hexagonal Hierarchical Spatial Index

H3: Uber's Hexagonal Grid System San Francisco Bay Area 881e620993fffff 881e620991fffff 881e620995fffff Hexagon Properties: β€’ Resolution 8: ~0.74 kmΒ² β€’ 15 resolution levels (0-14) H3 Hashing Process Input: Lat/Lng Coordinates (37.7749Β°, -122.4194Β°) H3 Index Generation 1. Project to icosahedron face 2. Find hexagon at resolution 3. Generate 64-bit index H3 Index (64-bit) 0x881e620993fffff Uber Use Cases πŸš— Surge Pricing Group demand by hexagon πŸ“ Driver Matching Find drivers in nearby hexes πŸ“Š Analytics Aggregate stats by region πŸ—ΊοΈ Visualization Heatmaps at any zoom level ⚑ Fast Lookups O(1) hex neighbor finding H3 in Action // Python example using h3-py import h3 lat, lng = 37.7749, -122.4194 # San Francisco hex_id = h3.geo_to_h3(lat, lng, resolution=8) # Returns: '881e620993fffff' neighbors = h3.k_ring(hex_id, k=1) # Get 6 neighboring hexagons boundary = h3.h3_to_geo_boundary(hex_id) # Get hexagon vertices

Locality Sensitive Hashing (LSH)

Finding Similar Items Fast

LSH is designed so similar items hash to the same buckets with high probability.

PSEUDOCODE: LSH for Similarity Search

// MinHash for Document Similarity
MinHash(document, k):    // k hash functions
    shingles = extract_shingles(document, n=3)
    signature = []

    for i in 0..k-1:
        min_hash = ∞
        for shingle in shingles:
            hash_val = hash_i(shingle)
            min_hash = min(min_hash, hash_val)
        signature.append(min_hash)

    return signature

// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
    hash_bits = []

    for plane in random_planes:
        dot_product = dot(vector, plane.normal)
        if dot_product >= 0:
            hash_bits.append(1)
        else:
            hash_bits.append(0)

    return bits_to_int(hash_bits)

// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
    d = len(vector)
    m = len(lookup_tables)    // number of subvectors
    subvector_dim = d / m

    codes = []
    for i in 0..m-1:
        subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
        nearest_prototype = find_nearest(subvector, lookup_tables[i])
        codes.append(nearest_prototype.id)

    return codes    // compressed representation

What is Quantization?

Intuition: Quantization = Lossy compression that groups similar values together

Think of it like:

What's a "lookup table" in PQ?

Product Quantization (PQ) in Action:

  1. Split: 512-dim vector β†’ 8 chunks of 64 dims each

  2. Map: Each chunk β†’ nearest prototype from lookup table

  3. Store: 8 prototype IDs instead of 512 floats

  4. Compression: 512Γ—4 bytes β†’ 8Γ—1 byte = 256x smaller!

The Trade-off: Lose exact values, but preserve similarity relationships


Visual: LSH for Similar Image Search

LSH: Similar Images β†’ Same Hash Buckets Input Images 🐱 Cat 1 😺 Cat 2 πŸ• Dog πŸš— Car Feature Vectors Cat1: [0.8, 0.7, 0.9, ...] Cat2: [0.7, 0.8, 0.8, ...] Dog: [0.3, 0.9, 0.2, ...] Car: [0.1, 0.0, 0.5, ...] 512-dimensional embeddings Random Hyperplanes 4 Random Planes C1 C2 D Car Each plane creates a bit: Above plane = 1, Below = 0 LSH Hash Values Cat1: 1011 β†’ Bucket 11 Cat2: 1011 β†’ Bucket 11 Dog: 1001 β†’ Bucket 9 Car: 0100 β†’ Bucket 4 Similar β†’ Same bucket! Query Results Query: Find similar to 🐱 βœ“ Fast: Check bucket 11 only βœ“ Found: 😺 (similar cat) βœ— Missed: πŸ• (bucket 9) βœ— Missed: πŸš— (bucket 4) No need to compare with all images in database!

Vector/Embedding Hashing

Hashing High-Dimensional Data

Modern ML creates high-dimensional embeddings (512-2048 dims). We need efficient ways to hash them.

Product Quantization
Split vector, quantize parts
Random Projection
Project to lower dims
Learned Hashing
Neural network hash functions
Applications
Search, RecSys, Dedup

Real-World Applications

Facebook (Meta)

Google

Spotify

OpenAI


Key Takeaways

  1. Basic hashing maps any data to fixed range [0, B-1]

  2. Integer/String hashing forms foundation of hash tables

  3. Collisions are inevitable - handle with chaining or open addressing

  4. Geospatial hashing (H3) enables location-based services

  5. LSH finds similar items without comparing everything

  6. Vector hashing makes ML embeddings searchable at scale

The Punchline: Hashing turns infinite spaces into finite, manageable buckets while preserving important properties like exact lookups, similarity or locality!