Bloom Filters

Concept. A Bloom filter is a bit-array hashed by k functions. It answers "definitely not in the set" or "probably in the set" in constant time and a few KB of RAM, with no false negatives.

Intuition. A Bloom filter over Listens.song_id fits in a few KB. Before reading any page, ask the filter "is song X here?" If it says no, skip the page. If it says maybe, do the real read. Cuts wasted IO ~99% of the time.

Space-efficient probabilistic data structure

Tells you "definitely not in set" or "maybe in set".

10x

Space savings

O(1)

Insert & lookup

1%

Typical false positive rate

0%

False negatives


Visual: How Bloom Filters Work

Bloom Filter: Bit Array + Hash Functions 16-bit array (real ones use millions of bits) 0 [0] 0 [1] 0 [2] 1 [3] 0 [4] 0 [5] 0 [6] 1 [7] 0 [8] 1 [9] 0 [10] 1 [11] 0 [12] 0 [13] 1 [14] 0 [15] INSERT "alice" hash1("alice") % 16 = 3 → bit[3] = 1 hash2("alice") % 16 = 7 → bit[7] = 1 hash3("alice") % 16 = 14 → bit[14] = 1 Bits 3, 7, 14 are now set. Bits 9 and 11 came from prior inserts. LOOKUP "bob" hash1("bob") % 16 = 3 → bit[3] = 1 ✓ hash2("bob") % 16 = 9 → bit[9] = 1 ✓ hash3("bob") % 16 = 2 → bit[2] = 0 ✗ bit[2] = 0 ⇒ NOT IN SET (guaranteed).

Problem-Solving Applications

For detailed examples of how Bloom filters solve real-world problems at scale, see the Data Structures Problem Solving guide, which covers:

  • CDN Cache Optimization (Cloudflare): How Cloudflare uses Bloom Filters to prevent "One-Hit Wonder" files (assets requested only once) from ever being written to their SSD caching layer. This intercepts billions of useless IO write operations, preventing physical hardware degradation of their global SSDs (tying back to our Storage Hierarchy constraints).

  • Web Crawler Efficiency: How Google and Bing prevent duplicate page crawling

  • Real-time Username Validation: How social platforms provide instant availability feedback


Optional

// Bloom Filter - 20 lines that save 10x space
class BloomFilter:
    bits = BitArray(size=m)      // m bits, all 0
    k = 3                         // number of hash functions

    Add(item):
        for i in 0..k-1:
            index = hash_i(item) % m
            bits[index] = 1

    Contains(item):
        for i in 0..k-1:
            index = hash_i(item) % m
            if bits[index] == 0:
                return False      // Definitely not in set
        return True               // Maybe in set

// Optimal parameters (math magic)
m = -n * ln(p) / (ln(2)^2)      // bits needed
k = m/n * ln(2)                  // hash functions

Example: 1M items, 1% false positive
→ m = 9.6M bits = 1.2MB
→ k = 7 hash functions

The Trade-off

Space per Item False Positive Rate Hash Functions (k)
4 bits 10% 3
10 bits 1% 7
14 bits 0.1% 10

Rule of thumb: 10 bits per item = 1% false positives = sweet spot


Key Takeaway

Bloom filters trade a tiny false positive rate for massive space savings. They're perfect when:

  • "Definitely no" is useful (cache misses, crawled URLs)

  • Dataset is huge (billions of items)

  • Some false positives are acceptable

System Design Note: Bloom filters are the standard solution for "have we seen this?" queries at scale. For examples of how Bloom filters solve real-world problems at scale, see the Data Structures Problem Solving guide.