Selectivity & Statistics: How Databases Estimate Data Flow
Concept. Selectivity is the fraction of rows a predicate keeps; the database's statistics (histograms, distinct-value counts) let the query planner estimate selectivity to choose the cheapest plan.
Intuition. WHERE user_id = 1 keeps a tiny sliver of Listens, one user's rows out of billions, far under 1%; WHERE rating IS NOT NULL keeps ~95%. The optimizer needs to know which is which: a highly selective predicate should use the index, a near-everything predicate should scan. Statistics tell it which.
Core Concepts: Selectivity & Cardinality
| Selectivity | Cardinality | |
|---|---|---|
| Definition | Fraction passing the filter (0.0 – 1.0) | Number of rows in the output |
| Examples | user_id = 123 → 0.0001genre = 'Rock' → 0.30rating IS NOT NULL → 0.95 |
Output = Input × Selectivitye.g. 1M × 0.01 = 10K rows |
| Combining | AND: S₁ × S₂OR: S₁ + S₂ − (S₁ × S₂) |
FK join: ≈ \|R\| (R = FK side)M:N join: \|R\|×\|S\| / max(V_R, V_S) |
Why it matters: bad estimates → 10–1000× slower plans. Selectivity drives algorithm choice (hash vs. nested loop), access method (index vs. scan), buffer sizing, and parallelism.
Step 1: Gathering Statistics
The pipeline: sample / scan → build structures → store & use.
Sampling Techniques
Figure 1. Reservoir sampling is the default when you need an unbiased uniform sample in one streaming pass over a table too big to shuffle. Keep the first k rows; for each later row i, swap it into a random slot with probability k/i, otherwise discard it. The result is exact: every one of the N rows is in the final sample with equal probability k/N, using only O(k) memory and no second pass. Two cheaper alternatives trade exactness for speed: block sampling reads whole pages (fast I/O, but biased when rows are clustered on disk), and progressive sampling starts small and grows until the estimate's variance is low enough (cheap when the data is easy, more passes when it is not).
Step 2: Storing Statistics
After gathering, build compressed representations.
Table-level vs. column-level stats
| Scope | What's stored | Stored where |
|---|---|---|
| Table | row_count, page_count, avg_row_len, last_analyzed, staleness |
PostgreSQL: pg_class · MySQL: innodb_table_stats · BigQuery: INFORMATION_SCHEMA |
| Column | null_fraction, distinct_count, most-common values + frequencies, histogram bounds |
PostgreSQL: pg_statistic |
Column-level stats enable range estimates, equality selectivity, NULL handling, and join cardinality.
Key trade-off. We can't store every value, so we use compressed structures that trade space for accuracy. Histograms model distributions; sketches answer specific queries (membership, distinct count, frequency).
Histograms: Modeling Value Distribution
Figure 2. An equi-width histogram splits the value range into equal slices; it is trivial to build but on skewed data one bucket swallows almost all the rows, so any selectivity read off it is wildly wrong. A height-balanced histogram instead gives every bucket the same row count (each holds 1/B of the rows), so bucket ranges automatically shrink where data is dense, which keeps estimates accurate exactly where skew would otherwise hurt. A third kind, the wavelet histogram, stores a multi-resolution summary for complex or multi-dimensional distributions.
Probabilistic sketches
| Sketch | Answers | Space | Error | Use case |
|---|---|---|---|---|
| Bloom Filter | "Might item x be in the set?" |
~10 bits/item for 1% FP | 1% false positive | Skip partitions in distributed query |
| HyperLogLog | COUNT(DISTINCT x) |
~1.5 KB total | ±1% | Unique visitors over billions |
| Count-Min Sketch | "How often did item x appear?" |
~10 KB for 0.1% error | ε with prob δ | Top-K / heavy hitters |
# Bloom filter (space O(m) bits)
def INSERT(x):
for h IN hash_funcs:
bits[h(x)] = 1
def contains(x):
return ALL(bits[h(x)] for h IN hash_funcs)
# HyperLogLog (space ~1.5KB total)
def add(value):
h = hash(value)
bucket = h & (m - 1)
zeros = leading_zeros(h)
buckets[bucket] = max(buckets[bucket], zeros)
# Count-Min Sketch (space O(w×d))
def increment(item):
for i, h in enumerate(hash_funcs):
table[i][h(item)] += 1
def estimate(item):
return min(table[i][h(item)] for i, h in enumerate(hash_funcs))
Step 3: Using Statistics for Selectivity Patterns
| Selectivity | Examples | Recommended access |
|---|---|---|
| Low (< 1%) | Primary key lookup · unique constraint · rare value | Index |
| Medium (1 – 10%) | Common value · single-month date range · category filter | Bitmap scan |
| High (> 10%) | IS NOT NULL · broad date range · popular category |
Sequential scan |
One thing the three steps above do not fix: statistics go stale as the table changes. When they drift far enough, ANALYZE (or autovacuum) re-gathers them; until then every estimate above is computed against the old shape of the data.
Next: IO Cost Summary puts these selectivity estimates to work, comparing two query plans by hand using the sort, hash, and join cost formulas from Module 3 nanoDB: IO Algorithms.