IO Cost Model: Algorithmic Complexity for Big Data
IO Complexity Fundamentals
Small Data vs Big Data: Different Worlds
Small Data
Assumption: All data fits in RAM
Bottleneck: CPU operations dominate
Optimization: Faster algorithms (O(n log n) vs O(n²))
Scale: <1GB datasets
Big Data
Reality: IO Cost >> CPU Cost
Difference: 1,000,000Ć (ms vs ns)
Optimization: Minimize disk access patterns
Scale: >100GB datasets
Three Examples: Building Your Intuition
Key Notation for IO Costs
For complete formulas and device specifications, see: IO Reference Sheet
- C_r, C_w: Time cost to read/write one page (calculated from access time + transfer time)
- T(R) = n: Number of tuples (rows) in table R (or n in algorithms notation)
- P(R): Number of pages in table R
- IO Cost: Total cost = numPages Ć C_r (or C_w)
Example 1: Basic Read/Write
⢠Reads P(R) pages
⢠Writes P(R) pages
C_r Ć P(R) + C_w Ć P(R)
Example 2: Multi-Pass
⢠Reads P(R) pages 7 times
⢠Writes P(R) pages 3 times
⢠CPU: 20 à T(R) operations
C_r Ć 7 Ć P(R) + C_w Ć 3 Ć P(R)
Example 3: Logarithmic
⢠Reads P(R) pages log P(R) times
⢠Writes 0.1ĆP(R) pages T(R) times
⢠Like binary search iterations
C_r Ć log(P(R)) Ć P(R) + C_w Ć 0.1 Ć P(R) Ć T(R)
Common IO Patterns in Database Algorithms
Pattern 1: Sequential Scan
-
Cost: P(R) Ć C_r
-
When used: Table scans, aggregations
-
Example: Finding all records matching a condition
Pattern 2: Multiple Passes
-
Cost: k Ć P(R) Ć C_r (k = number of passes)
-
When used: Sorting, hash partitioning
Pattern 3: Nested Access
-
Cost: P(R) Ć P(S) Ć C_r (catastrophic!)
-
When used: Naive nested loop join
-
Why avoided: Quadratic cost is unacceptable
Pattern 4: Logarithmic Access
-
Cost: log(P(R)) Ć C_r
-
When used: Index lookups
-
Example: Finding a specific record in sorted data
Key Takeaways
The Big Picture:
-
Forget O(n log n) - That's CPU complexity, irrelevant for big data
-
Count IOs: N, 2N, 3N - This determines actual runtime
-
GPUs are data hungry - Need smart data algorithms
-
1 saved IO = 10 million saved CPU/GPU ops - Still true in 2024
-
Smart algorithms minimize disk access, whether feeding CPUs or GPUs
The Punchline:
GPUs gave us more parallel compute. But the game is still about feeding the beast efficiently.