IO Cost Model: Algorithmic Complexity for Big Data

IO Complexity Fundamentals

Measure with CPU Complexity

Example: O(n log n) for sorting n values

Assumption: All data fits in RAM

Bottleneck: CPU operations dominate

Optimization: Faster algorithms (O(n log n) vs O(n²))

Scale: <1GB datasets

Measure with IO Complexity

Formula: Total cost ≈ IO Cost

Reality: IO Cost >> CPU Cost

Difference: 1,000,000× (ms vs ns)

Optimization: Minimize disk access patterns

Scale: >100GB datasets

For complete formulas and device specifications, see: IO Reference Sheet

C_r, C_w: Time cost to read/write one page (calculated from access time + transfer time)
T(R) = n: Number of tuples (rows) in table R (or n in algorithms notation)
P(R): Number of pages in table R
IO Cost: Total cost = numPages × C_r (or C_w)

Algorithm A1:
• Reads P(R) pages
• Writes P(R) pages

C_r × P(R) + C_w × P(R)

Algorithm A2:
• Reads P(R) pages 7 times
• Writes P(R) pages 3 times
• CPU: 20 × T(R) operations

C_r × 7 × P(R) + C_w × 3 × P(R)

Ignore CPU: too fast!

Algorithm A3:
• Reads P(R) pages log P(R) times
• Writes 0.1×P(R) pages T(R) times
• Like binary search iterations

C_r × log(P(R)) × P(R) + C_w × 0.1 × P(R) × T(R)

The Big Picture:

Forget O(n log n) - That's CPU complexity, irrelevant for big data
Count IOs: N, 2N, 3N - This determines actual runtime
Smart algorithms minimize disk access, whether feeding CPUs or GPUs. 1 saved IO = 10 million saved CPU/GPU ops