Selectivity & Statistics: How Databases Estimate Data Flow

Concept. Selectivity is the fraction of rows a predicate keeps; the database's statistics (histograms, distinct-value counts) let the query planner estimate selectivity to choose the cheapest plan.

Intuition. WHERE user_id = 1 keeps ~0.01% of Listens (3 rows out of a billion); WHERE rating > 0 keeps ~50%. The optimizer needs to know which is which: a 0.01% predicate should use the index, a 50% predicate should scan. Statistics tell it which.

Core Concepts: Selectivity & Cardinality

Selectivity Basics

Definition: Fraction passing filter (0.0-1.0)

user_id = 123 → 0.0001
genre = 'Rock' → 0.30
rating IS NOT NULL → 0.95

Combining:
AND: S₁ × S₂
OR: S₁ + S₂ - (S₁ × S₂)

Cardinality Estimation

Basic Formula:

Output = Input × Selectivity

Example:
1M rows × 0.01 = 10K rows

Join Cardinality:
FK Join: max(|R|, |S|)
M:N Join: |R|×|S| / V(key)

Why It Matters

Algorithm choice: Hash vs nested loop
Access method: Index vs scan
Memory allocation: Size buffers correctly
Parallelism: How many workers?

Bad estimates →
10-1000x slower plans

Step 1: Gathering Statistics

How do we learn about our data?

The Statistics Pipeline

1. Sample/Scan Data → 2. Build Structures → 3. Store & Use

Sampling Techniques

Reservoir Sampling

Uniform sample of k rows

for i, row in stream:
  if i < k:
    sample.append(row)
  else:
    j = random(0, i)
    if j < k:
      sample[j] = row

✓ Unbiased
✓ Fixed memory
✗ Requires full scan

Use when: Need exact uniform distribution

Block Sampling

Sample entire pages

ANALYZE TABLE 
SAMPLE 1000 PAGES;

# Faster I/O:
# Read 1000 pages
# Not 1000 random rows

✓ Fast I/O
✓ Cache-friendly
✗ Clustering bias

Use when: Data is well-distributed

Progressive Sampling

Adaptive sample size

sample_size = 100
while variance > threshold:
  sample_size *= 10
  gather_more_samples()

# Stop when confident

✓ Adaptive accuracy
✓ Efficient for simple data
✗ May need many passes

Use when: Unknown data distribution

Step 2: Storing Statistics

After gathering data, we build and store compressed representations

Table-Level Stats

Songs_Stats {
  row_count: 1,000,000
  page_count: 10,000
  avg_row_len: 128 bytes
  last_analyzed: '2024-01-15'
  staleness: 5%
}

Stored in:

PostgreSQL: pg_class
MySQL: innodb_table_stats
BigQuery: INFORMATION_SCHEMA

Column-Level Stats

artist_column {
  null_fraction: 0.001
  distinct_count: 50,000
  most_common: ['Swift', 'Drake']
  frequencies: [0.03, 0.02]
  histogram: [bounds...]
}

Enables:

Range query estimates
Equality selectivity
NULL handling
Join cardinality

Data Structures for Distribution Modeling

Key Insight: We can't store every value, so we use compressed data structures that trade space for accuracy

Histograms: Modeling Value Distribution

Height-Balanced

Equal rows per bucket, varying ranges

Rating distribution:
[1.0-2.5] ████ 25%
[2.5-3.5] ████ 25%  
[3.5-4.2] ████ 25%
[4.2-5.0] ████ 25%

Best for: Skewed data

Query: rating > 4.2
→ Selectivity = 0.25

Equi-Width

Equal ranges, varying row counts

Year distribution:
[1950-1975] █ 5%
[1975-2000] ███ 15%
[2000-2025] ████████████ 80%

Best for: Uniform data

Query: year > 2020
→ Interpolate: ~0.16

Wavelet (Advanced)

Hierarchical compression

Multi-resolution:
Level 0: ████████ (coarse)
Level 1: ██ ██ ██ ██
Level 2: █ █ █ █ █ █ █ █

Best for: Complex patterns

Benefits: Compact, multi-dimensional, range queries

Probabilistic Data Structures

Bloom Filter

Membership testing

# Space: O(m) bits
# False positive: ~1%
insert(x):
  for h in hash_funcs:
    bits[h(x)] = 1

contains(x):
  return all(bits[h(x)] 
    for h in hash_funcs)

Use: "Might be in set"
Example: Skip partitions in distributed query
Size: 10 bits per item for 1% FP

HyperLogLog

Cardinality estimation

# Space: O(m) = 1.5KB
# Error: ±1% 
add(value):
  h = hash(value)
  bucket = h & (m-1)
  zeros = leading_zeros(h)
  buckets[bucket] = 
    max(buckets[bucket], zeros)

Use: COUNT(DISTINCT)
Example: Unique visitors without storing IDs
Size: 1.5KB for billions

Count-Min Sketch

Frequency counting

# Space: O(w×d)
# Error: ε with prob δ
increment(item):
  for i, h in hash_funcs:
    table[i][h(item)] += 1

estimate(item):
  return min(table[i][h(item)]
    for i, h in hash_funcs)

Use: Top-K queries
Example: Most frequent queries
Size: 10KB for 0.1% error

Step 3: Using Statistics

Common Selectivity Patterns

High Selectivity (< 1%)

Primary key = value
Unique constraint
Rare value lookup

→ Use index

Medium (1-10%)

Common value
Date range (month)
Category filter

→ Consider bitmap scan

Low (> 10%)

NOT NULL check
Broad range
Popular category

→ Sequential scan

Key Takeaways

The Complete Statistics Pipeline

1. Gather

Sample or scan data
Balance cost vs accuracy

2. Store

Build histograms & sketches
Compress intelligently

3. Use

Estimate selectivity
Choose optimal plans

4. Update

Monitor staleness
Re-analyze when needed

Without statistics, even the best optimizer is just guessing.
With good statistics, it can choose plans that are 10-1000x faster!