Distributed Query Planning

Scaling Queries Across Multiple Machines

When data grows beyond one machine, queries must be fragmented, distributed, and reassembled efficiently.

100TB
Too big for one node
1000 Nodes
Parallel execution
Shuffle
Data movement
100x Faster
vs single node

What is "Shuffle"?

Shuffle is moving data between nodes to ensure related records end up on the same machine for processing. It's essentially hash partitioning over the network.

Local Hash Partitioning

hash(user_id) % 100 → Local file 7
All user_42 data → Same local file
Process file 7 in RAM

Distributed Shuffle

hash(user_id) % 100 → Network → Node 7  
All user_42 data → Same remote node
Process on Node 7

Key Point: Same algorithm, different destination (local disk vs remote node)

Connection to Hash Partitioning: Shuffle is the network version of the hash partitioning algorithm we learned earlier. Instead of writing hash(key) % B to local files, we send hash(key) % nodes over the network. Same concept, distributed execution!

Review: Hash Partitioning shows the local version


The Journey: One Query, Many Nodes

Distributed Query Execution Original Request "Top Artists by Play Count" (Join + Filter + Group By + Sort) Fragment & Parallelize Data Distributed Across Nodes Node 0 Node 1 Node 2 ... Node 999 Phase 1: Local Execution (Partial Aggregation) Node 0 1. Filter Listens (Today) 2. Join Song Metadata 3. Count plays per Artist Node 1 1. Filter Listens (Today) 2. Join Song Metadata 3. Count plays per Artist ... Node 999 1. Filter Listens (Today) 2. Join Song Metadata 3. Count plays per Artist Phase 2: Network Shuffle Redistribute partial results by Artist ID Goal: All "Taylor Swift" counts end up on Node 7 Phase 3: Global Aggregation (Join Results) Coordinator Nodes (Merge partial counts) Taylor Swift Sum: 3.5M plays Drake Sum: 3.2M plays Bad Bunny Sum: 2.9M plays Final Step: Sort & Deliver Top 10 Chart Generated

Core Concepts of Distributed Planning

Fragmentation

The optimizer splits the logical plan into independent fragments. Each node executes the same code but on its own slice of data.


The Shuffle: Moving Data Efficiently

Shuffle is the network version of a hash partition. When two records need to be processed together (e.g., for a Join or a Group By) but live on different nodes, we must move them.

Shuffle Strategies

Strategy Mechanism Best For Network Cost
Broadcast Shuffle Send a copy of the entire table to every node. Joining a tiny table with a massive one. Small Table Size × # Nodes
Hash Shuffle Redistribute both tables by a common hash key. Joining two large tables. Total Size of Both Tables
Co-located (Zero Shuffle) No movement; data is already partitioned correctly. Tables pre-partitioned on the same key. Zero Network I/O

Reaggregation Patterns: The Two-Phase Execution

Problem: Attempting to group data globally across 1,000 nodes would overwhelm the network with billions of tiny records.

Phase 1: Local Pre-Aggregation

Each node performs the grouping and aggregation on its local data first.

Phase 2: Global Merge

The partial results are shuffled by the grouping key (e.g., artist_id) to a coordinator node.


Optional:Real Systems Comparison

System Architecture Key Optimization Shuffle Strategy
Apache Spark In-memory RDDs Catalyst optimizer + code gen Hash & broadcast joins
BigQuery (Dremel) Columnar tree Hierarchical aggregation Minimize data movement
Snowflake Storage/compute separation Micro-partitions + caching Automatic clustering
Presto/Trino Pipelined execution Dynamic filtering Statistics-driven