IO Cost Model: Algorithmic Complexity for Big Data


IO Complexity Fundamentals

Small Data vs Big Data: Different Worlds

In the land of small data, it's all about CPU complexity. Think O(n log n) for sorting n values. The assumption here? Everything fits in RAM, and the CPU is the bottleneck. The game is about finding faster algorithms, like opting for O(n log n) over O(n²). This world deals with datasets under 1GB.

Now, step into the big data arena, where IO complexity reigns supreme. Here, the cost of moving data dwarfs CPU costs—by a factor of a million. It's not about crunching numbers faster; it's about minimizing how often you need to touch the disk. We're talking datasets over 100GB, where disk access patterns are the real optimization target.

Three Examples: Building Your Intuition

Key Notation for IO Costs

For those who crave the details, check the IO Reference Sheet for complete formulas and device specs. Here's the shorthand:

Example Algorithms

Example 1: Basic Read/Write

Algorithm A1 reads and writes P(R) pages. The cost? Simple:

[ C_r \times P(R) + C_w \times P(R) ]

Example 2: Multi-Pass

Algorithm A2 reads P(R) pages seven times and writes them three times. CPU operations? 20 times T(R), but they're too fast to matter:

[ C_r \times 7 \times P(R) + C_w \times 3 \times P(R) ]

Example 3: Logarithmic

Algorithm A3 reads P(R) pages log P(R) times and writes 0.1×P(R) pages T(R) times. It's akin to binary search iterations:

[ C_r \times \log(P(R)) \times P(R) + C_w \times 0.1 \times P(R) \times T(R) ]


Key Takeaways

The Big Picture

  1. Forget O(n log n) - That's for CPU complexity, not big data.

  2. Count IOs: N, 2N, 3N - This is what really determines runtime.

  3. Smart algorithms minimize disk access. Save one IO, and you've saved 10 million CPU/GPU operations.