Problem Solving: Saving $550,000 For Queries
The Bill
Goal: Understand how Indexing, Columnar Storage, and Compression multiply to solve a "Query Disaster."
You review your AWS bill. One queryβa weekly report on "Trending Genres"βcosts Spotify $1,500 every single day. It takes 9 hours to run, and runs 100 times daily.
Total cost? $550,000 per year.
System Archeology: The Schema
To fix the disaster, let's understand the data's physical footprint.
| Component | Stats | Disk Footprint (Standard Pages) |
|---|---|---|
| Songs Table | ~50M songs | 400 pages (25 GB) |
| Listens Table | ~10B total listens | 16,000 pages (1 TB) |
| Weekly Scope | Oct 21-27 Filter | 150 pages (~1% of data) |
| Memory (B) | Buffer Pool | 20 pages (1.28 GB) |
We use a standard page size of 64MB for all calculations in this study.
Problem 0: The Baseline Investigation
Problem: The weekly trending report is taking 9 hours. Before optimizing, we need to calculate the current "Baseline" cost. What is the IO cost of a Hash Partition Join (HPJ) on these two tables. Should the optimizer chose a Block Nested Loop Join (BNLJ) instead?
π Show Solution
Disaster Option 1: Block Nested Loop Join (BNLJ)
If the query optimizer chose BNLJ, results would be catastrophic.
-
Formula: M + (M / B) N
-
Math: 400 + (400 / 20) 16,000 = 320,400 IOs
-
The Why: BNLJ forces the database to scan the 1TB table for every block of the 25GB table. It is the absolute "Disaster" scenario for big data.
Disaster Option 2: Hash Partition Join (HPJ)
HPJ is much better but still flawed for this use case.
-
Formula: 3 * (M + N)
-
Math: 3 * (400 + 16,000) = 49,200 IOs
-
The Why: HPJ is efficient for joining two large tables, but here we only need 1% of the data. Scanning the entire 1TB to find that 1% is massive waste.
Problem 1: The Index Win
Problem: We know that only ~150 pages out of the 16,000 contain the data for Oct 21-27. How do we skip the other pages, and what is the new IO cost?
π Show Solution
-
Logic: Create a B+ Tree Index on
Listens(listen_time). -
Design: The index allows a more targeted scan rather than a full join or scan.
-
IO Math:
- Scan Songs: 400 IOs
- Index Lookup: ~4 IOs
- Fetch filtered Listens: 150 IOs
- Total: ~554 IOs
-
The Why: Indexes prune the search space. By jumping directly to the start date and reading only the 1% target data, we achieve a 90x improvement vs plain HPJ.
Problem 2: The Column Store Win
Problem: Even with an index, row-based storage loads 150 bytes per listen just to read the 24 bytes we need (song_id, user_id, time, duration). How do we eliminate this "column waste," and what is the estimated IO win?
π Show Solution
-
Logic: Switch to a Columnar Store (e.g., BigQuery, Parquet).
-
Design: Physically store each field in its own set of pages. This allows the CPU to ignore the IP addresses and device IDs entirely.
-
IO Math:
- We effectively read only 1/6th of the row width.
- Total: ~60 IOs (Listens: 25 IOs + Songs: 35 IOs)
-
The Why: Columnar storage is optimized for aggregation. By only reading the columns mentioned in the
SELECTandWHEREclauses, we reduce IO by another 9x.
Problem 3: The Final Compression
Problem: How can we reduce the footprint of the genre and rating columns without losing data?
π Show Solution
-
Technique 1 (Dictionary): Map the 200 unique Genres to 1-byte IDs (16 pages -> 1 page).
-
Technique 2 (Bit Packing): Since ratings are 1-5, store them in 4 bits instead of a 32-bit integer (7 pages -> 1 page).
-
Technique 3 (RLE): Since data is sorted, compress repeating User IDs (13 pages -> 5 pages).
-
Final Victory: ~35 IOs | 20 Seconds | $400 / Year
-
The Why: Domain-specific compression.
Summary: The Compound ROI
We optimized the $550,000 disaster to $400 by multiplying three discrete LEGO blocks:
| Stage | IO Reduction | Annual Cost | Win Factor |
|---|---|---|---|
| Baseline (HPJ) | 50,000 IOs | $550,000 | 1x |
| + Index | 550 IOs | $6,500 | 90x |
| + Columnar | 60 IOs | $730 | 9x |
| + Compression | 35 IOs | $365 | 2x |
Total Impact: 1,400x improvement.