Storage & Paging: The IO Cost Calculator

How Does One Machine Process 128GB with Only 16GB RAM?

The Paging Pipeline: CPU ← RAM ← SSD CPU Processing Can only access data in RAM Processes one page at a time RAM Buffer Pool: 16GB Holds 256 pages × 64MB each ... LRU eviction when full SSD Storage: 128GB Table 2,048 pages × 64MB each ... ... 2,048 total pages ... 100ns 10μs + transfer Key Insight • CPU can only process RAM data • RAM holds 256 pages (16GB) • Table has 2,048 pages (128GB) • Pages are 64MB each • LRU eviction when RAM full → Constant paging needed!

Storage Hierarchy: Speed, Cost, and Capacity

Storage Level Access Latency Throughput Cost per TB Typical Capacity Use Case
CPU Registers 1 cycle - - < 1KB Immediate values
L1/L2 Cache 1-10ns - - 64KB - 8MB Hot instructions
RAM (Buffer Pool) 100ns 100 GB/s $3,500 16GB Working set pages
SSD 10μs 5 GB/s $75 512GB Active tables
HDD 10ms 100 MB/s $25 4TB Cold storage
Network Storage 1μs 10 GB/s Variable Distributed cache

Key Observations: RAM is 100× faster than SSD, 100,000× faster than HDD • Network RAM (1μs) beats local HDD (10ms) • Cost/TB and Speed are inversely related.


IO Cost Definitions

Access Latency: Time to initiate an IO operation before data transfer begins - The fixed overhead for starting any IO operation - Examples: SSD access (10μs), HDD seeks (10ms), RAM access (100ns)

Throughput: Data transfer rate once operation begins
- The sustained rate at which data moves after access starts - Examples: SSD (5 GB/s), RAM (100 GB/s), HDD (100 MB/s)

Key Insight: For large pages (64MB), transfer time dominates access time. For small pages, access time dominates.

Refresher: Read your OS materials on how the OS' IO controllers work. DBs rely on OS for those details.


Modern Reality: CPUs/GPUs Can't Escape the Disk Bottleneck

The 10,000,000× Gap

10ms
HDD seek time
10μs
SSD latency
100ns
RAM access
1ns
CPU/GPU cycle

The Math: 1 HDD seek = 10,000,000 GPU operations!

GPUs Are Data Hungry

# Training a 100GB model
load_data_from_ssd = 20 seconds  # 100GB ÷ 5GB/s
transfer_to_gpu = 3 seconds      # 100GB ÷ 32GB/s (PCIe 4.0)
gpu_training_epoch = 0.5 seconds # Blazing fast compute!

# Where does time go?
# 87% loading data from SSD
# 13% transferring to GPU over PCIe
# 2% actual GPU compute

The Bottleneck Chain:

  1. Data lives on disk - Your 1TB dataset won't fit in 80GB GPU memory
  2. PCI interfaces between hardware components are narrow - 32GB/s seems fast until you have 1TB to move
  3. GPU memory is limited - Even A100 only has 80GB
  4. Compute is free - 312 TFLOPS means compute takes ~0 time