Selectivity & Statistics: How Databases Estimate Data Flow
Concept. Selectivity is the fraction of rows a predicate keeps; the database's statistics (histograms, distinct-value counts) let the query planner estimate selectivity to choose the cheapest plan.
Intuition. WHERE user_id = 1 keeps ~0.01% of Listens (3 rows out of a billion); WHERE rating > 0 keeps ~50%. The optimizer needs to know which is which: a 0.01% predicate should use the index, a 50% predicate should scan. Statistics tell it which.
Core Concepts: Selectivity & Cardinality
Selectivity Basics
Definition: Fraction passing filter (0.0-1.0)
user_id = 123→ 0.0001genre = 'Rock'→ 0.30rating IS NOT NULL→ 0.95
Combining:
AND: S₁ × S₂
OR: S₁ + S₂ - (S₁ × S₂)
Cardinality Estimation
Basic Formula:
Output = Input × Selectivity
Example:
1M rows × 0.01 = 10K rows
Join Cardinality:
FK Join: max(|R|, |S|)
M:N Join: |R|×|S| / V(key)
Why It Matters
- Algorithm choice: Hash vs nested loop
- Access method: Index vs scan
- Memory allocation: Size buffers correctly
- Parallelism: How many workers?
Bad estimates →
10-1000x slower plans
Step 1: Gathering Statistics
How do we learn about our data?
The Statistics Pipeline
1. Sample/Scan Data → 2. Build Structures → 3. Store & Use
Sampling Techniques
Reservoir Sampling
Uniform sample of k rows
for i, row in stream:
if i < k:
sample.append(row)
else:
j = random(0, i)
if j < k:
sample[j] = row
- ✓ Unbiased
- ✓ Fixed memory
- ✗ Requires full scan
Use when: Need exact uniform distribution
Block Sampling
Sample entire pages
ANALYZE TABLE SAMPLE 1000 PAGES; # Faster I/O: # Read 1000 pages # Not 1000 random rows
- ✓ Fast I/O
- ✓ Cache-friendly
- ✗ Clustering bias
Use when: Data is well-distributed
Progressive Sampling
Adaptive sample size
sample_size = 100 while variance > threshold: sample_size *= 10 gather_more_samples() # Stop when confident
- ✓ Adaptive accuracy
- ✓ Efficient for simple data
- ✗ May need many passes
Use when: Unknown data distribution
Step 2: Storing Statistics
After gathering data, we build and store compressed representations
Table-Level Stats
Songs_Stats {
row_count: 1,000,000
page_count: 10,000
avg_row_len: 128 bytes
last_analyzed: '2024-01-15'
staleness: 5%
}
Stored in:
- PostgreSQL: pg_class
- MySQL: innodb_table_stats
- BigQuery: INFORMATION_SCHEMA
Column-Level Stats
artist_column {
null_fraction: 0.001
distinct_count: 50,000
most_common: ['Swift', 'Drake']
frequencies: [0.03, 0.02]
histogram: [bounds...]
}
Enables:
- Range query estimates
- Equality selectivity
- NULL handling
- Join cardinality
Data Structures for Distribution Modeling
Key Insight: We can't store every value, so we use compressed data structures that trade space for accuracy
Histograms: Modeling Value Distribution
Height-Balanced
Equal rows per bucket, varying ranges
Rating distribution: [1.0-2.5] ████ 25% [2.5-3.5] ████ 25% [3.5-4.2] ████ 25% [4.2-5.0] ████ 25%
Best for: Skewed data
Query: rating > 4.2
→ Selectivity = 0.25
Equi-Width
Equal ranges, varying row counts
Year distribution: [1950-1975] █ 5% [1975-2000] ███ 15% [2000-2025] ████████████ 80%
Best for: Uniform data
Query: year > 2020
→ Interpolate: ~0.16
Wavelet (Advanced)
Hierarchical compression
Multi-resolution: Level 0: ████████ (coarse) Level 1: ██ ██ ██ ██ Level 2: █ █ █ █ █ █ █ █
Best for: Complex patterns
Benefits: Compact, multi-dimensional, range queries
Probabilistic Data Structures
Bloom Filter
Membership testing
# Space: O(m) bits
# False positive: ~1%
insert(x):
for h in hash_funcs:
bits[h(x)] = 1
contains(x):
return all(bits[h(x)]
for h in hash_funcs)
Use: "Might be in set"
Example: Skip partitions in distributed query
Size: 10 bits per item for 1% FP
HyperLogLog
Cardinality estimation
# Space: O(m) = 1.5KB
# Error: ±1%
add(value):
h = hash(value)
bucket = h & (m-1)
zeros = leading_zeros(h)
buckets[bucket] =
max(buckets[bucket], zeros)
Use: COUNT(DISTINCT)
Example: Unique visitors without storing IDs
Size: 1.5KB for billions
Count-Min Sketch
Frequency counting
# Space: O(w×d)
# Error: ε with prob δ
increment(item):
for i, h in hash_funcs:
table[i][h(item)] += 1
estimate(item):
return min(table[i][h(item)]
for i, h in hash_funcs)
Use: Top-K queries
Example: Most frequent queries
Size: 10KB for 0.1% error
Step 3: Using Statistics
Common Selectivity Patterns
High Selectivity (< 1%)
- Primary key = value
- Unique constraint
- Rare value lookup
→ Use index
Medium (1-10%)
- Common value
- Date range (month)
- Category filter
→ Consider bitmap scan
Low (> 10%)
- NOT NULL check
- Broad range
- Popular category
→ Sequential scan
Key Takeaways
The Complete Statistics Pipeline
1. Gather
Sample or scan data
Balance cost vs accuracy
2. Store
Build histograms & sketches
Compress intelligently
3. Use
Estimate selectivity
Choose optimal plans
4. Update
Monitor staleness
Re-analyze when needed
Without statistics, even the best optimizer is just guessing.
With good statistics, it can choose plans that are 10-1000x faster!