Selectivity & Statistics: How Databases Estimate Data Flow

The 0.01% vs 50% Problem

Selectivity Estimation: The Foundation of Query Optimization Scenario A: WHERE listen_date = '2024-12-25' 1,000,000 Listens Full table scan Selectivity: 0.01% Filter 100 rows āœ“ Use Index Scan 100 random IOs better than Scenario B: WHERE rating >= 3 1,000,000 Listens Full table scan Selectivity: 50% Filter 500,000 rows āœ— Avoid Index 500K random IOs worse than Key Insight: Same Query Structure, Different Plans! Good selectivity estimates (0.01% vs 50%) lead to completely different execution strategies. Without statistics, the optimizer is flying blind and may choose plans that are 100x slower. Cost Impact of Wrong Choice: Scenario A with Seq Scan: 10,000 IOs (100x worse) Scenario B with Index: 500,000 IOs (50x worse) How does the optimizer know? Statistics, Histograms, Samples, and Cardinality Estimates!

Core Concepts: Selectivity & Cardinality

Selectivity Basics

Definition: Fraction passing filter (0.0-1.0)

  • user_id = 123 → 0.0001
  • genre = 'Rock' → 0.30
  • rating IS NOT NULL → 0.95
Combining:
AND: S₁ Ɨ Sā‚‚
OR: S₁ + Sā‚‚ - (S₁ Ɨ Sā‚‚)

Cardinality Estimation

Basic Formula:

Output = Input Ɨ Selectivity

Example:
1M rows Ɨ 0.01 = 10K rows

Join Cardinality:
FK Join: max(|R|, |S|)
M:N Join: |R|Ɨ|S| / V(key)

Why It Matters

  • Algorithm choice: Hash vs nested loop
  • Access method: Index vs scan
  • Memory allocation: Size buffers correctly
  • Parallelism: How many workers?
Bad estimates →
10-1000x slower plans

Step 1: Gathering Statistics

How do we learn about our data?

The Statistics Pipeline

1. Sample/Scan Data → 2. Build Structures → 3. Store & Use

Sampling Techniques

Reservoir Sampling

Uniform sample of k rows

for i, row in stream:
  if i < k:
    sample.append(row)
  else:
    j = random(0, i)
    if j < k:
      sample[j] = row
  • āœ“ Unbiased
  • āœ“ Fixed memory
  • āœ— Requires full scan

Use when: Need exact uniform distribution

Block Sampling

Sample entire pages

ANALYZE TABLE 
SAMPLE 1000 PAGES;

# Faster I/O:
# Read 1000 pages
# Not 1000 random rows
  • āœ“ Fast I/O
  • āœ“ Cache-friendly
  • āœ— Clustering bias

Use when: Data is well-distributed

Progressive Sampling

Adaptive sample size

sample_size = 100
while variance > threshold:
  sample_size *= 10
  gather_more_samples()

# Stop when confident
  • āœ“ Adaptive accuracy
  • āœ“ Efficient for simple data
  • āœ— May need many passes

Use when: Unknown data distribution


Step 2: Storing Statistics

After gathering data, we build and store compressed representations

Table-Level Stats

Songs_Stats {
  row_count: 1,000,000
  page_count: 10,000
  avg_row_len: 128 bytes
  last_analyzed: '2024-01-15'
  staleness: 5%
}

Stored in:

  • PostgreSQL: pg_class
  • MySQL: innodb_table_stats
  • BigQuery: INFORMATION_SCHEMA

Column-Level Stats

artist_column {
  null_fraction: 0.001
  distinct_count: 50,000
  most_common: ['Swift', 'Drake']
  frequencies: [0.03, 0.02]
  histogram: [bounds...]
}

Enables:

  • Range query estimates
  • Equality selectivity
  • NULL handling
  • Join cardinality

Data Structures for Distribution Modeling

Key Insight: We can't store every value, so we use compressed data structures that trade space for accuracy

Histograms: Modeling Value Distribution

Height-Balanced

Equal rows per bucket, varying ranges

Rating distribution:
[1.0-2.5] ā–ˆā–ˆā–ˆā–ˆ 25%
[2.5-3.5] ā–ˆā–ˆā–ˆā–ˆ 25%  
[3.5-4.2] ā–ˆā–ˆā–ˆā–ˆ 25%
[4.2-5.0] ā–ˆā–ˆā–ˆā–ˆ 25%

Best for: Skewed data

Query: rating > 4.2
→ Selectivity = 0.25

Equi-Width

Equal ranges, varying row counts

Year distribution:
[1950-1975] ā–ˆ 5%
[1975-2000] ā–ˆā–ˆā–ˆ 15%
[2000-2025] ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ 80%

Best for: Uniform data

Query: year > 2020
→ Interpolate: ~0.16

Wavelet (Advanced)

Hierarchical compression

Multi-resolution:
Level 0: ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ (coarse)
Level 1: ā–ˆā–ˆ ā–ˆā–ˆ ā–ˆā–ˆ ā–ˆā–ˆ
Level 2: ā–ˆ ā–ˆ ā–ˆ ā–ˆ ā–ˆ ā–ˆ ā–ˆ ā–ˆ

Best for: Complex patterns

Benefits: Compact, multi-dimensional, range queries

Probabilistic Data Structures

Bloom Filter

Membership testing

# Space: O(m) bits
# False positive: ~1%
insert(x):
  for h in hash_funcs:
    bits[h(x)] = 1

contains(x):
  return all(bits[h(x)] 
    for h in hash_funcs)

Use: "Might be in set"
Example: Skip partitions in distributed query
Size: 10 bits per item for 1% FP

HyperLogLog

Cardinality estimation

# Space: O(m) = 1.5KB
# Error: ±1% 
add(value):
  h = hash(value)
  bucket = h & (m-1)
  zeros = leading_zeros(h)
  buckets[bucket] = 
    max(buckets[bucket], zeros)

Use: COUNT(DISTINCT)
Example: Unique visitors without storing IDs
Size: 1.5KB for billions

Count-Min Sketch

Frequency counting

# Space: O(wƗd)
# Error: ε with prob Γ
increment(item):
  for i, h in hash_funcs:
    table[i][h(item)] += 1

estimate(item):
  return min(table[i][h(item)]
    for i, h in hash_funcs)

Use: Top-K queries
Example: Most frequent queries
Size: 10KB for 0.1% error


Step 3: Using Statistics

Common Selectivity Patterns

High Selectivity (< 1%)
  • Primary key = value
  • Unique constraint
  • Rare value lookup
→ Use index

Medium (1-10%)

  • Common value
  • Date range (month)
  • Category filter

→ Consider bitmap scan

Low (> 10%)

  • NOT NULL check
  • Broad range
  • Popular category

→ Sequential scan

Real-World Example

PostgreSQL Statistics in Action

Gather Statistics

ANALYZE songs;

SELECT attname, n_distinct, 
       null_frac
FROM pg_stats 
WHERE tablename = 'songs';

See the Impact

EXPLAIN SELECT * FROM songs 
WHERE genre = 'Jazz';

Seq Scan (cost=0..2500 
  rows=15000 estimated)
  (actual rows=14823)

Result: Estimate within 1.2% of actual!


Key Takeaways

The Complete Statistics Pipeline

1. Gather
Sample or scan data
Balance cost vs accuracy

2. Store
Build histograms & sketches
Compress intelligently

3. Use
Estimate selectivity
Choose optimal plans

4. Update
Monitor staleness
Re-analyze when needed

Without statistics, even the best optimizer is just guessing.
With good statistics, it can choose plans that are 10-1000x faster!