Case Study 2.3: How does Google Chrome protect 3 billion users using Bloom Filters?
How does Google Chrome protect 3 billion users from malicious sites?
Objective: Achieve 100x data reduction for secure, privacy-conscious lookups.
Google's Safe Browsing service flags over 100 million malicious URLs, ranging from phishing to malware. The mission: shield every Chrome user from these threats. But two hurdles stand in the way: Scale and Privacy.
The Challenges: Scale and Privacy
-
Scale: Storing 100 million URLs, each averaging 80 characters, demands roughly 8 GB of storage. Expecting every device to download this blacklist is impractical.
-
Privacy: Constantly pinging Google's servers with every URL you visit raises privacy red flags. Nobody wants a tech giant tracking their every click.
-
Infrastructure: With 3 billion users online, querying Google's servers for each URL would flood the system with trillions of requests daily, straining bandwidth and risking server overload. A single point of failure isn't an option.
The Solution: The 80MB Local Filter
Chrome opts for a probabilistic method. Instead of the full list, it downloads a compressed Bloom filter of URL hashes.
The "100x" Compression Advantage:
-
Raw List: 100M URLs × 80 Bytes = 8,000 MB (8 GB)
-
Bloom Filter: 100M items × 6.4 bits/item = 640M bits = 80 MB
-
Efficiency: The Bloom filter is 100 times smaller than the raw list.
Technical Workflow: The Privacy-Preserving Check
The Bloom filter enables Chrome to assess "is this site bad?" directly on your device.
-
Local Check: Chrome hashes the URL and checks against its local 80MB Bloom filter.
-
Confidential Path: If the filter returns "Not in set," the URL is safe. Chrome doesn't inform Google, keeping your browsing history private.
-
"K-Anonymity" Trick: If the filter indicates "Maybe in set" (potential threat), Chrome sends only a prefix (e.g., first 32 bits) of the hash. Google responds with all malicious hashes sharing that prefix, and Chrome completes the check locally.
Python Implementation: Malicious URL Filter
import mmh3 # MurmurHash3
FROM bitarray import bitarray
class SafeBrowsingFilter:
def __init__(self, expected_items, fpr):
# Math: bits_per_item = -1.44 * log2(fpr)
# For 1% FPR (0.01), we need ~10 bits per item
self.size = expected_items * 10
self.bit_array = bitarray(self.size)
self.bit_array.setall(0)
self.hash_count = 7 # Optimal for 1% error
def add_malicious_url(self, url):
for seed IN range(self.hash_count):
index = mmh3.hash(url, seed) % self.size
self.bit_array[index] = 1
def check_url(self, url):
for seed IN range(self.hash_count):
index = mmh3.hash(url, seed) % self.size
if self.bit_array[index] == 0:
return "SAFE" # 100% Guaranteed
return "MAYBE_THREAT" # Perform prefix-check validation
# Setup filter for 1M malicious URLs
sbf = SafeBrowsingFilter(1_000_000, 0.01)
sbf.add_malicious_url("phish-bank.ru")
# Instant local lookup
print(f"'google.com': {sbf.check_url('google.com')}") # SAFE
print(f"'phish-bank.ru/is-this-is-a-malicious-link.html': {sbf.check_url('phish-bank.ru/is-this-a-malicious-link.html')}") # MAYBE_THREAT
Comparison: Why not just use a Database?
| Method | Space (100M URLs) | Privacy Level | User Experience |
|---|---|---|---|
| Full DB Download | ~8 GB | High (Local) | Slow / Data Heavy |
| Cloud Lookup | 0 MB | Zero (Google sees all) | Network Lag |
| Bloom Filter | ~80 MB | High (Google sees <1%) | Instant |
Conclusion: The Bloom filter serves as a "Privacy Shield," allowing the browser to handle 99.9% of browsing locally. It only consults the cloud for a prefix check when a threat is likely.