IO Cost Model: Algorithmic Complexity for Big Data


IO Complexity Fundamentals

Small Data vs Big Data: Different Worlds

Small Data

Measure with CPU Complexity
Example: O(n log n) for sorting n values

Assumption: All data fits in RAM

Bottleneck: CPU operations dominate

Optimization: Faster algorithms (O(n log n) vs O(n²))

Scale: <1GB datasets

Big Data

Measure with IO Complexity
Formula: Total cost ā‰ˆ IO Cost

Reality: IO Cost >> CPU Cost

Difference: 1,000,000Ɨ (ms vs ns)

Optimization: Minimize disk access patterns

Scale: >100GB datasets

Three Examples: Building Your Intuition

Key Notation for IO Costs

For complete formulas and device specifications, see: IO Reference Sheet

  • C_r, C_w: Time cost to read/write one page (calculated from access time + transfer time)
  • T(R) = n: Number of tuples (rows) in table R (or n in algorithms notation)
  • P(R): Number of pages in table R
  • IO Cost: Total cost = numPages Ɨ C_r (or C_w)

Example 1: Basic Read/Write

Algorithm A1:
• Reads P(R) pages
• Writes P(R) pages

C_r Ɨ P(R) + C_w Ɨ P(R)

Example 2: Multi-Pass

Algorithm A2:
• Reads P(R) pages 7 times
• Writes P(R) pages 3 times
• CPU: 20 Ɨ T(R) operations

C_r Ɨ 7 Ɨ P(R) + C_w Ɨ 3 Ɨ P(R)

Ignore CPU: too fast!

Example 3: Logarithmic

Algorithm A3:
• Reads P(R) pages log P(R) times
• Writes 0.1ƗP(R) pages T(R) times
• Like binary search iterations

C_r Ɨ log(P(R)) Ɨ P(R) + C_w Ɨ 0.1 Ɨ P(R) Ɨ T(R)


Common IO Patterns in Database Algorithms

Pattern 1: Sequential Scan

Pattern 2: Multiple Passes

Pattern 3: Nested Access

Pattern 4: Logarithmic Access


Key Takeaways

The Big Picture:

  1. Forget O(n log n) - That's CPU complexity, irrelevant for big data

  2. Count IOs: N, 2N, 3N - This determines actual runtime

  3. GPUs are data hungry - Need smart data algorithms

  4. 1 saved IO = 10 million saved CPU/GPU ops - Still true in 2024

  5. Smart algorithms minimize disk access, whether feeding CPUs or GPUs

The Punchline:
GPUs gave us more parallel compute. But the game is still about feeding the beast efficiently.