Hashing: From Basics to Advanced

The Universal Data Structure Technique

Hash Functions: Mapping Infinity to Finite Input Space (∞) 42 (integer) "hello world" (string) (37.7749, -122.4194) (geo) [0.1, 0.5, ..., 0.9] (vector) {complex object} (any type) h(x) β†’ [0, B-1] Properties: βœ“ Deterministic: Same input β†’ Same output βœ“ Uniform: Even distribution βœ“ Fast: O(1) computation βœ“ Fixed Range: [0, B-1] Hash Space [0, B-1] Bucket 0: 3 items Bucket 1: 1 item Bucket 2: 2 items ... Bucket B-1: 4 items

Part 1: Basic Hashing

What is Hashing?

Hashing is a technique to map data of arbitrary size to fixed-size values. Think of it as a deterministic compression function that always maps the same input to the same output.

O(1)
Average Access Time
Deterministic
Same Input β†’ Same Output
Fixed Range
[0, B-1] Output
Uniform
Even Distribution

Integer Hashing

Simple Modulo Hash

The most basic hash function for integers:

Optional: Integer Hashing

// Simple Modulo Hash
hash_modulo(x, B):
    return x % B

// MurmurHash-style (for better distribution in production)
hash_murmur(x, B):
    x ^= x >> 16
    x *= 0x85ebca6b
    x ^= x >> 13
    x *= 0xc2b2ae35
    x ^= x >> 16
    return x % B

Visual Example: Integer Hashing

Example: h(x) = x % 10 Input Integers: 42 157 1042 89 Hash Calculation: 42 % 10 = 2 157 % 10 = 7 1042 % 10 = 2 89 % 10 = 9 Hash Buckets [0-9]: 0 1 2 42 1042 3 4 5 6 7 157 8 9 89 Collision: 42 and 1042 both hash to bucket 2

String Hashing

From Characters to Numbers

Strings need to be converted to numbers before hashing. Common approaches:

Optional: String Hashing

// Simple Sum (Bad - many collisions)
hash_sum(str, B):
    sum = 0
    for char in str:
        sum += ASCII(char)
    return sum % B

// Polynomial Rolling Hash (Better)
hash_poly(str, B):
    P = 31    // prime number
    hash = 0
    pow = 1
    for char in str:
        hash = (hash + ASCII(char) * pow) % B
        pow = (pow * P) % B
    return hash

Visual Example: String Hashing

Polynomial Hash: "hello" β†’ hash value String: "hello" 'h' 104 'e' 101 'l' 108 'l' 108 'o' 111 Polynomial Hash (P=31, B=100): hash = 0, pow = 1 'h': hash = (0 + 104*1) % 100 = 4, pow = 31 'e': hash = (4 + 101*31) % 100 = 35, pow = 61 'l': hash = (35 + 108*61) % 100 = 23, pow = 91 'l': hash = (23 + 108*91) % 100 = 51, pow = 21 'o': hash = (51 + 111*21) % 100 = 82 Final: h("hello") = 82 Different Strings, Different Hashes: "hello" β†’ 82 "world" β†’ 13 "Hello" β†’ 51 // Case sensitive! "olleh" β†’ 29 // Order matters! "hi" β†’ 71 String Hash Properties: βœ“ Order sensitive βœ“ Case sensitive βœ“ Length affects hash βœ“ Anagrams have different hashes βœ“ Good distribution with prime P

Hash Collisions

The Birthday Paradox

With just 23 people, there's >50% chance two share a birthday. Similarly, hash collisions are inevitable!

√B
Items for 50% collision chance
Chaining
Linked lists at each bucket
Open Addressing
Find next empty slot
Load Factor
n/B < 0.75 typical

Part 2: Advanced Hashing

Geospatial Hashing

Uber's H3: Hexagonal Hierarchical Spatial Index

H3: Uber's Hexagonal Grid System San Francisco Bay Area 881e620993fffff 881e620991fffff 881e620995fffff Hexagon Properties: β€’ Resolution 8: ~0.74 kmΒ² β€’ 15 resolution levels (0-14) H3 Hashing Process Input: Lat/Lng Coordinates (37.7749Β°, -122.4194Β°) H3 Index Generation 1. Project to icosahedron face 2. Find hexagon at resolution 3. Generate 64-bit index H3 Index (64-bit) 0x881e620993fffff Uber Use Cases πŸš— Surge Pricing Group demand by hexagon πŸ“ Driver Matching Find drivers in nearby hexes πŸ“Š Analytics Aggregate stats by region πŸ—ΊοΈ Visualization Heatmaps at any zoom level ⚑ Fast Lookups O(1) hex neighbor finding H3 in Action // Python example using h3-py import h3 lat, lng = 37.7749, -122.4194 # San Francisco hex_id = h3.geo_to_h3(lat, lng, resolution=8) # Returns: '881e620993fffff' neighbors = h3.k_ring(hex_id, k=1) # Get 6 neighboring hexagons boundary = h3.h3_to_geo_boundary(hex_id) # Get hexagon vertices

Locality Sensitive Hashing (LSH)

Finding Similar Items Fast

LSH is designed so similar items hash to the same buckets with high probability.

Optional: LSH for Similarity Search

// MinHash for Document Similarity
MinHash(document, k):    // k hash functions
    shingles = extract_shingles(document, n=3)
    signature = []

    for i in 0..k-1:
        min_hash = ∞
        for shingle in shingles:
            hash_val = hash_i(shingle)
            min_hash = min(min_hash, hash_val)
        signature.append(min_hash)

    return signature

// LSH for Vector Similarity (Random Projection)
LSH_Vector(vector, num_planes):
    hash_bits = []

    for plane in random_planes:
        dot_product = dot(vector, plane.normal)
        if dot_product >= 0:
            hash_bits.append(1)
        else:
            hash_bits.append(0)

    return bits_to_int(hash_bits)

// Facebook Faiss-style Product Quantization
PQ_Hash(vector, lookup_tables):
    d = len(vector)
    m = len(lookup_tables)    // number of subvectors
    subvector_dim = d / m

    codes = []
    for i in 0..m-1:
        subvector = vector[i*subvector_dim : (i+1)*subvector_dim]
        nearest_prototype = find_nearest(subvector, lookup_tables[i])
        codes.append(nearest_prototype.id)

    return codes    // compressed representation

What is Quantization?

Intuition: Quantization = Lossy compression that groups similar values together

Think of it like:

Example: Product Quantization (PQ) in Action:

  1. Split: 512-dim vector β†’ 8 chunks of 64 dims each

  2. Map: Each chunk β†’ nearest prototype from lookup table

  3. Store: 8 prototype IDs instead of 512 floats

  4. Compression: 512Γ—4 bytes β†’ 8Γ—1 byte = 256x smaller!

The Trade-off: Lose exact values, but preserve similarity relationships


Visual: LSH for Similar Image Search

LSH: Similar Images β†’ Same Hash Buckets Input Images 🐱 Cat 1 😺 Cat 2 πŸ• Dog πŸš— Car Feature Vectors Cat1: [0.8, 0.7, 0.9, ...] Cat2: [0.7, 0.8, 0.8, ...] Dog: [0.3, 0.9, 0.2, ...] Car: [0.1, 0.0, 0.5, ...] 512-dimensional embeddings Random Hyperplanes 4 Random Planes C1 C2 D Car Each plane creates a bit: Above plane = 1, Below = 0 LSH Hash Values Cat1: 1011 β†’ Bucket 11 Cat2: 1011 β†’ Bucket 11 Dog: 1001 β†’ Bucket 9 Car: 0100 β†’ Bucket 4 Similar β†’ Same bucket! Query Results Query: Find similar to 🐱 Fast: Check bucket 11 only Found: 😺 (similar cat) Missed: πŸ• (bucket 9) Missed: πŸš— (bucket 4) No need to compare with all images in database!

Vector/Embedding Hashing

Hashing High-Dimensional Data

Modern ML creates high-dimensional embeddings (512-2048 dims). We need efficient ways to hash them.

Product Quantization
Split vector, quantize parts
Random Projection
Project to lower dims
Learned Hashing
Neural network hash functions

Real-World Applications

Facebook (Meta)

Google

Spotify

OpenAI


Key Takeaways

  1. Basic hashing maps any data to fixed range [0, B-1]

  2. Integer/String hashing forms foundation of hash tables

  3. Collisions are inevitable - handle with chaining or open addressing

  4. Geospatial hashing (H3) enables location-based services

  5. LSH finds similar items without comparing everything

  6. Vector hashing makes ML embeddings searchable at scale

The Punchline: Hashing turns infinite spaces into finite, manageable buckets while preserving important properties like exact lookups, similarity or locality!