IO Cost Model: Algorithmic Complexity for Big Data
IO Complexity Fundamentals
Small Data vs Big Data: Different Worlds
Small Data
Measure with CPU Complexity
Example: O(n log n) for sorting n values
Assumption: All data fits in RAM
Bottleneck: CPU operations dominate
Optimization: Faster algorithms (O(n log n) vs O(n²))
Scale: <1GB datasets
Big Data
Measure with IO Complexity
Formula: Total cost ā IO Cost
Reality: IO Cost >> CPU Cost
Difference: 1,000,000Ć (ms vs ns)
Optimization: Minimize disk access patterns
Scale: >100GB datasets
Three Examples: Building Your Intuition
Key Notation for IO Costs
For complete formulas and device specifications, see: IO Reference Sheet
- C_r, C_w: Time cost to read/write one page (calculated from access time + transfer time)
- T(R) = n: Number of tuples (rows) in table R (or n in algorithms notation)
- P(R): Number of pages in table R
- IO Cost: Total cost = numPages Ć C_r (or C_w)
Example 1: Basic Read/Write
Algorithm A1:
⢠Reads P(R) pages
⢠Writes P(R) pages
⢠Reads P(R) pages
⢠Writes P(R) pages
C_r Ć P(R) + C_w Ć P(R)
Example 2: Multi-Pass
Algorithm A2:
⢠Reads P(R) pages 7 times
⢠Writes P(R) pages 3 times
⢠CPU: 20 à T(R) operations
⢠Reads P(R) pages 7 times
⢠Writes P(R) pages 3 times
⢠CPU: 20 à T(R) operations
C_r Ć 7 Ć P(R) + C_w Ć 3 Ć P(R)
Ignore CPU: too fast!
Example 3: Logarithmic
Algorithm A3:
⢠Reads P(R) pages log P(R) times
⢠Writes 0.1ĆP(R) pages T(R) times
⢠Like binary search iterations
⢠Reads P(R) pages log P(R) times
⢠Writes 0.1ĆP(R) pages T(R) times
⢠Like binary search iterations
C_r Ć log(P(R)) Ć P(R) + C_w Ć 0.1 Ć P(R) Ć T(R)
Key Takeaways
The Big Picture:
-
Forget O(n log n) - That's CPU complexity, irrelevant for big data
-
Count IOs: N, 2N, 3N - This determines actual runtime
-
Smart algorithms minimize disk access, whether feeding CPUs or GPUs. 1 saved IO = 10 million saved CPU/GPU ops