Selectivity & Statistics: How Databases Estimate Data Flow
The 0.01% vs 50% Problem
Core Concepts: Selectivity & Cardinality
Selectivity Basics
Definition: Fraction passing filter (0.0-1.0)
user_id = 123ā 0.0001genre = 'Rock'ā 0.30rating IS NOT NULLā 0.95
AND: Sā Ć Sā
OR: Sā + Sā - (Sā Ć Sā)
Cardinality Estimation
Basic Formula:
Output = Input Ć Selectivity
Example:
1M rows Ć 0.01 = 10K rows
FK Join: max(|R|, |S|)
M:N Join: |R|Ć|S| / V(key)
Why It Matters
- Algorithm choice: Hash vs nested loop
- Access method: Index vs scan
- Memory allocation: Size buffers correctly
- Parallelism: How many workers?
10-1000x slower plans
Step 1: Gathering Statistics
How do we learn about our data?
The Statistics Pipeline
Sampling Techniques
Reservoir Sampling
Uniform sample of k rows
for i, row in stream:
if i < k:
sample.append(row)
else:
j = random(0, i)
if j < k:
sample[j] = row
- ā Unbiased
- ā Fixed memory
- ā Requires full scan
Use when: Need exact uniform distribution
Block Sampling
Sample entire pages
ANALYZE TABLE SAMPLE 1000 PAGES; # Faster I/O: # Read 1000 pages # Not 1000 random rows
- ā Fast I/O
- ā Cache-friendly
- ā Clustering bias
Use when: Data is well-distributed
Progressive Sampling
Adaptive sample size
sample_size = 100 while variance > threshold: sample_size *= 10 gather_more_samples() # Stop when confident
- ā Adaptive accuracy
- ā Efficient for simple data
- ā May need many passes
Use when: Unknown data distribution
Step 2: Storing Statistics
After gathering data, we build and store compressed representations
Table-Level Stats
Songs_Stats {
row_count: 1,000,000
page_count: 10,000
avg_row_len: 128 bytes
last_analyzed: '2024-01-15'
staleness: 5%
}
Stored in:
- PostgreSQL: pg_class
- MySQL: innodb_table_stats
- BigQuery: INFORMATION_SCHEMA
Column-Level Stats
artist_column {
null_fraction: 0.001
distinct_count: 50,000
most_common: ['Swift', 'Drake']
frequencies: [0.03, 0.02]
histogram: [bounds...]
}
Enables:
- Range query estimates
- Equality selectivity
- NULL handling
- Join cardinality
Data Structures for Distribution Modeling
Histograms: Modeling Value Distribution
Height-Balanced
Equal rows per bucket, varying ranges
Rating distribution: [1.0-2.5] āāāā 25% [2.5-3.5] āāāā 25% [3.5-4.2] āāāā 25% [4.2-5.0] āāāā 25%
Best for: Skewed data
Query: rating > 4.2
ā Selectivity = 0.25
Equi-Width
Equal ranges, varying row counts
Year distribution: [1950-1975] ā 5% [1975-2000] āāā 15% [2000-2025] āāāāāāāāāāāā 80%
Best for: Uniform data
Query: year > 2020
ā Interpolate: ~0.16
Wavelet (Advanced)
Hierarchical compression
Multi-resolution: Level 0: āāāāāāāā (coarse) Level 1: āā āā āā āā Level 2: ā ā ā ā ā ā ā ā
Best for: Complex patterns
Benefits: Compact, multi-dimensional, range queries
Probabilistic Data Structures
Bloom Filter
Membership testing
# Space: O(m) bits
# False positive: ~1%
insert(x):
for h in hash_funcs:
bits[h(x)] = 1
contains(x):
return all(bits[h(x)]
for h in hash_funcs)
Use: "Might be in set"
Example: Skip partitions in distributed query
Size: 10 bits per item for 1% FP
HyperLogLog
Cardinality estimation
# Space: O(m) = 1.5KB
# Error: ±1%
add(value):
h = hash(value)
bucket = h & (m-1)
zeros = leading_zeros(h)
buckets[bucket] =
max(buckets[bucket], zeros)
Use: COUNT(DISTINCT)
Example: Unique visitors without storing IDs
Size: 1.5KB for billions
Count-Min Sketch
Frequency counting
# Space: O(wĆd)
# Error: ε with prob Γ
increment(item):
for i, h in hash_funcs:
table[i][h(item)] += 1
estimate(item):
return min(table[i][h(item)]
for i, h in hash_funcs)
Use: Top-K queries
Example: Most frequent queries
Size: 10KB for 0.1% error
Step 3: Using Statistics
Common Selectivity Patterns
- Primary key = value
- Unique constraint
- Rare value lookup
Medium (1-10%)
- Common value
- Date range (month)
- Category filter
ā Consider bitmap scan
Low (> 10%)
- NOT NULL check
- Broad range
- Popular category
ā Sequential scan
Real-World Example
PostgreSQL Statistics in Action
Gather Statistics
ANALYZE songs;
SELECT attname, n_distinct,
null_frac
FROM pg_stats
WHERE tablename = 'songs';
See the Impact
EXPLAIN SELECT * FROM songs WHERE genre = 'Jazz'; Seq Scan (cost=0..2500 rows=15000 estimated) (actual rows=14823)
Result: Estimate within 1.2% of actual!
Key Takeaways
The Complete Statistics Pipeline
Balance cost vs accuracy
Compress intelligently
Choose optimal plans
Re-analyze when needed
Without statistics, even the best optimizer is just guessing.
With good statistics, it can choose plans that are 10-1000x faster!