IO Algorithms: Cost Analysis & Intuition

Concept. The IO cost of every algorithm in this module (BigSort, HashPartition, BNLJ, SMJ, HPJ) can be expressed as a function of table size, buffer size, and the read/write cost ratio.

Intuition. Different algorithms win in different regimes. Small buffer favors HPJ; large buffer favors BNLJ; pre-sorted input favors SMJ. This page pulls all five formulas together so you can compare them side by side.

CS145 Equations Sheet

Real-World Examples: Choosing the Right Join Algorithm

Assume C_r = C_w = 1. For the JOINs below, we always add OUT (for the size of the output result). Always use the Equation Sheet for the full formulae.

Test Setup: B = 100 pages (≈ 6.4 GB RAM), Page Size = 64 MB

These examples use realistic table sizes from actual database workloads. Costs shown are simplified for clarity; real optimizer costs include additional factors.

Scenario Overview

Example	Description	Table R	Table S	Special Conditions	Key Insight
Ex1	Small Tables	Songs 10 GB	Listens 10 GB	none	Tables fit mostly in RAM
Ex2	Large Tables	Songs 100 GB	Listens 2 TB	none	Tables >> buffer size
Ex3	Pre-sorted Tables	Songs ↑ 100 GB sorted	Listens ↑ 2 TB sorted	Pre-sorted	SMJ skips sorting phase
Ex4	Self-Join	Listens 2 TB	Listens 2 TB (same)	Self-join	Can't optimize table order
Ex5	Song Similarity	Listens 100B rows	Listens 100B rows	Massive	Real-world similarity query

Algorithm Performance Comparison (Log Scale)

Edge cases not shown in the table:

SMJ bad-backup. When many duplicates exist, SMJ may have to back up and re-scan.
HPJ bad-skew. When data is heavily skewed, some hash partitions become too large to fit in RAM.
These edge cases can increase costs by 2–10× in the worst case.

Key Observations

Example 1: Small Tables Win with BNLJ

Songs (10GB) and Listens (10GB) are small relative to buffer size (6.4GB)
BNLJ performs the join mostly in RAM without preprocessing overhead
No sorting or partitioning needed → Lower cost than SMJ/HPJ

Examples 2-5: Large Tables Need Smart Algorithms

When tables >> buffer size, BNLJ becomes prohibitively expensive
SMJ and HPJ scale much better with data size
Pre-sorted data (Ex3) gives SMJ a huge advantage

Self-Joins Are Common and Expensive

Examples 4-5 show real patterns (finding similar songs, user correlations)
Self-joins can't optimize by choosing smaller outer relation
SMJ/HPJ are essential for these massive operations

Buffer Size (B)	C_r	C_w	BNLJ	BNLJ-rev	SMJ-best	HPJ-best	Winner
10 (tiny)	1	1	2.6M	2.6M	159K	54K	HPJ
	1	10	2.6M	2.6M	792K	216K	HPJ
	10	1	26M	26M	954K	378K	HPJ
	10	10	26M	26M	1.6M	540K	HPJ
100 (small)	1	1	264K	279K	90K	54K	HPJ
	1	10	264K	279K	414K	216K	HPJ
	10	1	2.6M	2.8M	575K	378K	HPJ
	10	10	2.6M	2.8M	899K	540K	HPJ
1,000 (medium)	1	1	34K	44K	87K	54K	BNLJ
	1	10	34K	44K	396K	216K	BNLJ
	10	1	344K	436K	558K	378K	BNLJ
	10	10	344K	436K	867K	540K	BNLJ
10,000 (large)	1	1	18K	20K	54K	54K	BNLJ
	1	10	18K	20K	216K	216K	BNLJ
	10	1	180K	196K	378K	378K	BNLJ
	10	10	180K	196K	540K	540K	BNLJ

IO Algorithms: Cost Analysis & Intuition

CS145 Equations Sheet

Real-World Examples: Choosing the Right Join Algorithm

Test Setup: B = 100 pages (≈ 6.4 GB RAM), Page Size = 64 MB

Scenario Overview

Algorithm Performance Comparison (Log Scale)

Key Observations

Example 1: Small Tables Win with BNLJ

Examples 2-5: Large Tables Need Smart Algorithms

Self-Joins Are Common and Expensive

The Query Optimizer's Decision Process

Exercise 2: Impact of Machine Configuration on IO Costs

Scenario: Songs3 (100GB) × Listens3 (1TB)

Cost Analysis Across Different Configurations

Key Observations

Buffer Size Impact

Read/Write Cost Impact

Practical Implications

Key Takeaways

Next