IO Cost Model: Algorithmic Complexity for Big Data

Concept. On data larger than RAM, total runtime is dominated by IO, not CPU; we model algorithm cost as C_r × pages_read + C_w × pages_written and ignore everything else.

Intuition. With Listens at a billion rows, a full table scan reads thousands of 64 MB pages and takes minutes on disk. The CPU work to compare 4.5 > 4.0 is essentially free in comparison. Total runtime is C_r × pages_read + C_w × pages_written; everything else rounds to zero.

IO Complexity Fundamentals

Small Data vs Big Data: Different Worlds

In the land of small data, it's all about CPU complexity. Think O(n log n) for sorting n values. The assumption here? Everything fits in RAM, and the CPU is the bottleneck. The game is about finding faster algorithms, like opting for O(n log n) over O(n²). This world deals with datasets under 1GB.

Now, step into the big data arena, where IO complexity reigns supreme. Here, the cost of moving data dwarfs CPU costs, by a factor of a million. It's not about crunching numbers faster; it's about minimizing how often you need to touch the disk. We're talking datasets over 100GB, where disk access patterns are the real optimization target.

Three Examples: Building Your Intuition

Key Notation for IO Costs

For those who crave the details, check the IO Reference Sheet for complete formulas and device specs. Here's the shorthand:

C_r, C_w: Time cost to read/write a page (access + transfer time)
T(R) = n: Number of tuples (rows) in table R
P(R): Number of pages in table R
IO Cost: Total cost = numPages × C_r (or C_w)

Example Algorithms

Example 1: Basic Read/Write

Algorithm A1 reads and writes P(R) pages. The cost? Simple:

$$ C_r \times P(R) + C_w \times P(R) $$

Example 2: Multi-Pass

Algorithm A2 reads P(R) pages seven times and writes them three times. CPU operations? 20 times T(R), but they're too fast to matter:

$$ C_r \times 7 \times P(R) + C_w \times 3 \times P(R) $$

Example 3: Logarithmic

Algorithm A3 reads P(R) pages log P(R) times and writes 0.1×P(R) pages T(R) times. It's akin to binary search iterations:

$$ C_r \times \log(P(R)) \times P(R) + C_w \times 0.1 \times P(R) \times T(R) $$

Key Takeaways

The Big Picture

Forget O(n log n) - That's for CPU complexity, not big data.
Count IOs: N, 2N, 3N - This is what really determines runtime.
Smart algorithms minimize disk access. Save one IO, and you've saved 10 million CPU/GPU operations.