Query Optimizer: Building Execution from Lego Blocks

Concept. The query optimizer transforms a SQL query into an executable plan by enumerating possible join orders and access paths, estimating each one's IO cost, and picking the cheapest.

Intuition. For a 3-table join (Users, Listens, Songs) there are six possible join orders and dozens of access paths (full scan vs index scan, hash join vs nested loop). The optimizer asks "what does each cost?" using statistics, then runs the winning plan. A 100× speedup over the naive ordering is normal.

The Magic Behind Every SQL Query

Your SQL isn't just text; it's a trigger for a complex optimizer that evaluates millions of execution strategies in milliseconds. It's like a chess grandmaster, considering countless moves before making the best one.

SQL Text

What you write

1000s of Plans

What optimizer considers

Best Plan

What executes

10-100x Faster

vs naive execution


The Journey: SQL to Execution

Query Optimizer: From SQL to Execution Plan SQL Query SELECT a.name, COUNT(*) FROM artists a JOIN songs s ON ... JOIN listens l ON ... WHERE l.timestamp > '2024' Parse Parse Tree SELECT JOIN GROUP orders customers Logical Plan Logical Plan π (PROJECT) γ (GROUP BY) ⋈ (JOIN) Optimize Physical Plan Options Option 1: Hash Join + Seq Scan Option 2: Sort-Merge + Index Option 3: Nested Loop + B+Tree Cost: 523 Cost: 412 Cost: 892 ✓ Cheapest! Algorithm Lego Blocks The optimizer combines these building blocks to create execution plans 1. Access Methods Sequential Scan Read all pages Good for >10% rows Index Scan B+Tree lookup Good for <1% rows Bitmap Scan Multiple indexes Good for 1-10% rows 2. Break Into Small Problems Hash Partition Big Sort 3. Aggregation Methods Hash Aggregate Hash table for groups Sort Aggregate Sort then group 4. Join Algorithms Hash Join Build hash table O(N+M) with memory Sort-Merge Join Sort both inputs O(N log N + M log M) Nested Loop Join For each row, scan O(N×M) but simple Cost Model: How the Optimizer Decides Cost Formula Cost = CPU_cost + IO_cost × 1000 IO is 1000x slower than CPU Random IO = 10ms, Sequential = 0.1ms Table Statistics • Rows: 1,000,000 • Pages: 10,000 • Distinct values: 50,000 • Histogram: value distribution Example: Join 1M × 10K rows • Hash Join: 1M + 10K = 1.01M IOs • Sort-Merge: 2M log 1M + 20K log 10K = 42M IOs • Nested Loop: 1M × 10K = 10B IOs → Optimizer picks Hash Join! Final Plan: Hash Join + Sequential Scan + Hash Aggregate Estimated: 412 cost units | Actual: 47ms | 10x faster than naive nested loops

Query Parsing: SQL → Tree

Your SQL query starts its life as a parse tree, a structured representation that sets the stage for optimization.

SELECT * FROM users WHERE age > 21

Becomes:

        SELECT
           |
        PROJECT(*)
           |
        FILTER(age > 21)
           |
        SCAN(users)

Logical Planning: Tree → Algebra

The parse tree morphs into relational algebra operations:

  • Selection (σ): Filters rows

  • Projection (π): Chooses columns

  • Join (⋈): Merges tables

  • Aggregation (γ): Groups data

Optimization rules kick in:

  • Push filters down to process them early

  • Cut out unnecessary projections

  • Simplify expressions


The Search Space Problem

3 tables to join = 3! = 6 possible join orders
4 tables = 4! = 24 orders
5 tables = 5! = 120 orders
10 tables = 10! = 3,628,800 orders!

Solution: Use dynamic programming to discard inefficient plans early.


Key Takeaways

  1. SQL is declarative - You define the result, the optimizer handles the method.

  2. Algorithms are lego blocks - Combine them for optimal performance.

  3. Cost-based decisions - Statistics guide which algorithms to use.

  4. Exponential search space - Efficient pruning is crucial.

  5. 10-100x performance difference - Between well-optimized and poor plans.

The query optimizer is your silent partner, transforming simple SQL into lightning-fast execution.