Bloom Filters
Concept. A Bloom filter is a bit-array hashed by k functions. It answers "definitely not in the set" or "probably in the set" in constant time and a few KB of RAM, with no false negatives.
Intuition. A Bloom filter over Listens.song_id fits in a few KB. Before reading any page, ask the filter "is song X here?" If it says no, skip the page. If it says maybe, do the real read. Cuts wasted IO ~99% of the time.
Space-efficient probabilistic data structure
Tells you "definitely not in set" or "maybe in set".
10x
Space savings
O(1)
Insert & lookup
1%
Typical false positive rate
0%
False negatives
Visual: How Bloom Filters Work
Problem-Solving Applications
For detailed examples of how Bloom filters solve real-world problems at scale, see the Data Structures Problem Solving guide, which covers:
-
CDN Cache Optimization (Cloudflare): How Cloudflare uses Bloom Filters to prevent "One-Hit Wonder" files (assets requested only once) from ever being written to their SSD caching layer. This intercepts billions of useless IO write operations, preventing physical hardware degradation of their global SSDs (tying back to our Storage Hierarchy constraints).
-
Web Crawler Efficiency: How Google and Bing prevent duplicate page crawling
-
Real-time Username Validation: How social platforms provide instant availability feedback
Optional
// Bloom Filter - 20 lines that save 10x space
class BloomFilter:
bits = BitArray(size=m) // m bits, all 0
k = 3 // number of hash functions
Add(item):
for i in 0..k-1:
index = hash_i(item) % m
bits[index] = 1
Contains(item):
for i in 0..k-1:
index = hash_i(item) % m
if bits[index] == 0:
return False // Definitely not in set
return True // Maybe in set
// Optimal parameters (math magic)
m = -n * ln(p) / (ln(2)^2) // bits needed
k = m/n * ln(2) // hash functions
Example: 1M items, 1% false positive
→ m = 9.6M bits = 1.2MB
→ k = 7 hash functions
The Trade-off
| Space per Item | False Positive Rate | Hash Functions (k) |
|---|---|---|
| 4 bits | 10% | 3 |
| 10 bits | 1% | 7 |
| 14 bits | 0.1% | 10 |
Rule of thumb: 10 bits per item = 1% false positives = sweet spot
Key Takeaway
Bloom filters trade a tiny false positive rate for massive space savings. They're perfect when:
-
"Definitely no" is useful (cache misses, crawled URLs)
-
Dataset is huge (billions of items)
-
Some false positives are acceptable
System Design Note: Bloom filters are the standard solution for "have we seen this?" queries at scale. For examples of how Bloom filters solve real-world problems at scale, see the Data Structures Problem Solving guide.