Problem Solving: Saving $550,000 For Queries

Putting It All Together Reading Time: 15 mins

The Bill

Goal: Understand how Indexing, Columnar Storage, and Compression multiply to solve a "Query Disaster."

You review your AWS bill. One queryβ€”a weekly report on "Trending Genres"β€”costs Spotify $1,500 every single day. It takes 9 hours to run, and runs 100 times daily.

Total cost? $550,000 per year.


System Archeology: The Schema

To fix the disaster, let's understand the data's physical footprint.

Component Stats Disk Footprint (Standard Pages)
Songs Table ~50M songs 400 pages (25 GB)
Listens Table ~10B total listens 16,000 pages (1 TB)
Weekly Scope Oct 21-27 Filter 150 pages (~1% of data)
Memory (B) Buffer Pool 20 pages (1.28 GB)

We use a standard page size of 64MB for all calculations in this study.


Problem 0: The Baseline Investigation

Problem: The weekly trending report is taking 9 hours. Before optimizing, we need to calculate the current "Baseline" cost. What is the IO cost of a Hash Partition Join (HPJ) on these two tables. Should the optimizer chose a Block Nested Loop Join (BNLJ) instead?

πŸ” Show Solution

Disaster Option 1: Block Nested Loop Join (BNLJ)

If the query optimizer chose BNLJ, results would be catastrophic.

  • Formula: M + (M / B) N

  • Math: 400 + (400 / 20) 16,000 = 320,400 IOs

  • The Why: BNLJ forces the database to scan the 1TB table for every block of the 25GB table. It is the absolute "Disaster" scenario for big data.

Disaster Option 2: Hash Partition Join (HPJ)

HPJ is much better but still flawed for this use case.

  • Formula: 3 * (M + N)

  • Math: 3 * (400 + 16,000) = 49,200 IOs

  • The Why: HPJ is efficient for joining two large tables, but here we only need 1% of the data. Scanning the entire 1TB to find that 1% is massive waste.


Problem 1: The Index Win

Problem: We know that only ~150 pages out of the 16,000 contain the data for Oct 21-27. How do we skip the other pages, and what is the new IO cost?

πŸ” Show Solution
  • Logic: Create a B+ Tree Index on Listens(listen_time).

  • Design: The index allows a more targeted scan rather than a full join or scan.

  • IO Math:

    • Scan Songs: 400 IOs
    • Index Lookup: ~4 IOs
    • Fetch filtered Listens: 150 IOs
    • Total: ~554 IOs
  • The Why: Indexes prune the search space. By jumping directly to the start date and reading only the 1% target data, we achieve a 90x improvement vs plain HPJ.


Problem 2: The Column Store Win

Problem: Even with an index, row-based storage loads 150 bytes per listen just to read the 24 bytes we need (song_id, user_id, time, duration). How do we eliminate this "column waste," and what is the estimated IO win?

πŸ” Show Solution
  • Logic: Switch to a Columnar Store (e.g., BigQuery, Parquet).

  • Design: Physically store each field in its own set of pages. This allows the CPU to ignore the IP addresses and device IDs entirely.

  • IO Math:

    • We effectively read only 1/6th of the row width.
    • Total: ~60 IOs (Listens: 25 IOs + Songs: 35 IOs)
  • The Why: Columnar storage is optimized for aggregation. By only reading the columns mentioned in the SELECT and WHERE clauses, we reduce IO by another 9x.


Problem 3: The Final Compression

Problem: How can we reduce the footprint of the genre and rating columns without losing data?

πŸ” Show Solution
  • Technique 1 (Dictionary): Map the 200 unique Genres to 1-byte IDs (16 pages -> 1 page).

  • Technique 2 (Bit Packing): Since ratings are 1-5, store them in 4 bits instead of a 32-bit integer (7 pages -> 1 page).

  • Technique 3 (RLE): Since data is sorted, compress repeating User IDs (13 pages -> 5 pages).

  • Final Victory: ~35 IOs | 20 Seconds | $400 / Year

  • The Why: Domain-specific compression.


Summary: The Compound ROI

We optimized the $550,000 disaster to $400 by multiplying three discrete LEGO blocks:

Stage IO Reduction Annual Cost Win Factor
Baseline (HPJ) 50,000 IOs $550,000 1x
+ Index 550 IOs $6,500 90x
+ Columnar 60 IOs $730 9x
+ Compression 35 IOs $365 2x

Total Impact: 1,400x improvement.