SQL to Parallel Execution: The 47ms Journey

Same 10 algorithms, different scales

Whether it's 5GB in SQLite or 5TB in BigQuery - it's still hash joins and sort-merge under the hood

The Challenge

Process 5.5TB of music streaming data to find the top 10 most-played songs since 2024.

The Query

SELECT s.title, s.artist, COUNT(*) AS plays
FROM songs s 
JOIN listens l ON s.song_id = l.song_id
WHERE l.timestamp > '2024-01-01'
GROUP BY s.title, s.artist
ORDER BY plays DESC
LIMIT 10;

The Journey: From Text to Results in 47ms

5.5TB → 100 Machines → 47ms 1 2 3 4 5 6 7 SELECT s.title, FROM songs s JOIN listens WHERE GROUP BY ORDER BY LIMIT 10 Phase 1: Parse 0-5ms Read from bottom to top! Each step transforms the data 📁 Songs 📁 Listens 🔗 Match Data Songs + Listens 🔍 Filter 2024+ Recent only ➕ Count Plays Per song 🔽 Sort by Plays Highest first 📊 Take Top 10 Final results Phase 2: Plan 5-12ms Hash Partition 1 Hash Partition 2 Hash Partition 3 Hash Partition 4 Hash Partition 5 Hash Partition ... Phase 3: Distribute 12-15ms 100 machines working in parallel Phase 4: Execute 15-45ms Top 10 Songs 1. Anti-Hero - 2.3M 2. Flowers - 2.1M 3. Unholy - 1.9M 4. As It Was - 1.8M 5. Boy's a Liar - 1.7M 6. Lavender - 1.6M 7. Kill Bill - 1.5M 8. Creepin' - 1.4M 9. Calm Down - 1.3M 10. Sure Thing - 1.2M Phase 5: Results 45-47ms 0ms 5ms 12ms 15ms 45ms 47ms Parse Build Plan Distribute Parallel Execution Assembly 5.5TB Data 55 Billion Records 100 Machines × 8 Cores = 800 Parallel Workers Machine Colors: Join/Scan Group/Count Sort/Filter

What You're Seeing

Phase 1: Parsing (0-5ms)

The SQL text is tokenized and parsed into an Abstract Syntax Tree (AST). The query optimizer analyzes the AST to determine the most efficient execution strategy based on statistics about table sizes, indexes, and data distribution.

Phase 2: Tree Building (5-12ms)

The logical plan transforms into a physical execution tree. Each node represents a specific algorithmic operation. The tree structure determines the order of operations and data flow. Notice how LIMIT is at the top - we can stop processing once we have 10 results.

Phase 3: Distribution (12-15ms)

The execution tree is decomposed into tasks that can run in parallel. Hash partitioning ensures that related data (same song_id for joins, same grouping keys) ends up on the same machine. This minimizes network traffic during execution.

Phase 4: Parallel Execution (15-45ms)

100 machines work simultaneously, each processing their partition of the data:

Each machine processes ~55GB of data using the same algorithms we'll learn in CS145, just distributed across multiple nodes.

Phase 5: Result Assembly (45-47ms)

Partial results from all machines converge to a coordinator node. The coordinator performs a final merge-sort to identify the global top 10 songs from the local top results of each machine.


Scale Reference

5.5TB
Total Data Processed
100
Parallel Machines
800
CPU Cores Active
47ms
Total Execution Time

Comparison: Single Machine vs Distributed

Aspect Single Machine 100 Machines
Data Size 55GB 5.5TB
Execution Time ~4.7 seconds 47ms
Parallelism 8 cores 800 cores
Bottleneck CPU/Memory Coordination
Complexity Simple Distributed protocols

Real-World Applications

This same pattern powers:

The journey from SELECT * FROM songs to processing petabytes at Google or Meta all builds on these same foundations.