Bloom Filters

Concept. A Bloom filter is a bit-array hashed by k functions. It answers "definitely not in the set" or "probably in the set" in constant time and a few KB of RAM, with no false negatives.

Intuition. A Bloom filter over Listens.song_id fits in a few KB. Before reading any page, ask the filter "is song X here?" If it says no, skip the page. If it says maybe, do the real read. Cuts wasted IO ~99% of the time.

Space-efficient probabilistic data structure

Tells you "definitely not in set" or "maybe in set".

10x

Space savings

O(1)

Insert & lookup

Typical false positive rate

False negatives

Visual: How Bloom Filters Work

Figure

Bloom filter: a bit array plus k hash functions. Insert sets bits; lookup tests them.

Bit array (top): 16 bits, drawn as circles. 0 means clear, 1 means set. Real Bloom filters use millions of bits, not 16.
INSERT "alice": hash alice with each of the k = 3 hash functions, take each output mod 16, set those bits. Bits 3, 7, 14 turn on. Bits 9 and 11 in the array are already on from earlier inserts that the figure does not show.
LOOKUP "bob": hash bob with the same three functions and check the bits. Two out of three are 1 (bits 3 and 9, both set by other items), but bit[2] = 0. Since at least one bit is 0, bob is guaranteed not to be in the set.
The two outcomes:
- If any bit is 0, the item is definitely not in the set. No false negatives.
- If all bits are 1, the item is maybe in the set. False positives are possible because earlier inserts could have collectively set those exact bits.
False positive rate: p ≈ (1 − e^−kn/m)^k, where m is the bit count, n is the inserted-item count, k is the hash count. Optimal k = (m/n) · ln 2.
Practical sizing: at 10 bits per item with optimal k ≈ 7, the false positive rate is about 0.8% (the often-quoted "1%"). 1 billion items needs 1.25 GB of bits (decimal; 10⁹ × 10 / 8 ÷ 10⁹). To get a tighter ~0.2% rate, bump to ~13 bits/item ≈ 1.6 GB.
vs in-memory sets: a Java HashSet<Long> for the same 1 billion items is roughly 80 GB in practice (48 B per HashMap.Node + 24 B per boxed Long + a 16 GB bucket array at load factor 0.75). A tightly-packed primitive long set is more like 16–24 GB. The Bloom filter is 60-70× smaller, at the cost of a small false-positive rate.

Problem-Solving Applications

For detailed examples of how Bloom filters solve real-world problems at scale, see the Data Structures Problem Solving guide, which covers:

CDN Cache Optimization (Cloudflare): How Cloudflare uses Bloom Filters to prevent "One-Hit Wonder" files (assets requested only once) from ever being written to their SSD caching layer. This intercepts billions of useless IO write operations, preventing physical hardware degradation of their global SSDs (tying back to our Storage Hierarchy constraints).
Web Crawler Efficiency: How Google and Bing prevent duplicate page crawling
Real-time Username Validation: How social platforms provide instant availability feedback

Optional

// Bloom Filter - 20 lines that save 10x space
class BloomFilter:
    bits = BitArray(size=m)      // m bits, all 0
    k = 3                         // number of hash functions

    Add(item):
        for i in 0..k-1:
            index = hash_i(item) % m
            bits[index] = 1

    Contains(item):
        for i in 0..k-1:
            index = hash_i(item) % m
            if bits[index] == 0:
                return False      // Definitely not in set
        return True               // Maybe in set

// Optimal parameters (math magic)
m = -n * ln(p) / (ln(2)^2)      // bits needed
k = m/n * ln(2)                  // hash functions

Example: 1M items, 1% false positive
→ m = 9.6M bits = 1.2MB
→ k = 7 hash functions

The Trade-off

Space per Item	False Positive Rate	Hash Functions (k)
4 bits	10%	3
10 bits	1%	7
14 bits	0.1%	10

Rule of thumb: 10 bits per item = 1% false positives = sweet spot

Key Takeaway

Bloom filters trade a tiny false positive rate for massive space savings. They're perfect when:

"Definitely no" is useful (cache misses, crawled URLs)
Dataset is huge (billions of items)
Some false positives are acceptable

System Design Note: Bloom filters are the standard solution for "have we seen this?" queries at scale. For examples of how Bloom filters solve real-world problems at scale, see the Data Structures Problem Solving guide.