Storage Layout: Row vs Columnar

How Tables Live on Disk: From OS Files to DbFiles

OS Files: The operating system supports basic files (any size, unstructured bytes).

DbFiles: are the DB's special files - structured as fixed 64MB pages. Every database table lives on disk as a DbFile.

Spotify Example:

Listens Table → listens.dbf (massive DbFile, 1000 pages = 64GB total)
├── Page 0: Contains 64MB of Listens data  
├── Page 1: Contains 64MB of Listens data
├── Page 2: Contains 64MB of Listens data  
├── Page 3: Contains 64MB of Listens data
└── ... (997 more pages)

Songs Table → songs.dbf (smaller DbFile, 10 pages = 640MB total)
├── Page 0: Contains 64MB of Songs data
├── Page 1: Contains 64MB of Songs data  
└── ... (8 more pages)

Why DbFiles Matter: Each 64MB page is the unit of I/O. When you query data, the database reads entire 64MB pages from disk to memory, not individual records.

Why 64MB pages? Recall IO Costs. IO Devices are slow. Bulk operations (e.g. one 64MB IO vs reading 2000 separate IOs for 32kb pages) help amortize costs.

Inside a 64MB Page: The Storage Layout Question

Row Storage
Columnar Storage
Hybrid Storage

1. Row-Oriented Storage: The Slotted Page

Inside a 64MB page, the database uses a Slots design to handle variable-length rows and fragmented space.

Grow data from the top and slots from the bottom
If a row is updated and stays the same size, we update it in place.
If it grows, we might have to move other rows, but we only update the pointer in the Slot Array.

2. Columnar Storage: Segments & Compression

Instead of one page containing many rows, one page typically contains values for one specific column for many rows.

Compression Magic for Columnar: The Scale Effect

Compression ratios improve dramatically with more data. Example Spotify Listens table:

Column	Small Scale (T(R) = 9)	Massive Scale (T(R) = 100M)	Compression Technique & Scaling
user_id	36B → 6B (83%)	400MB → 8MB (98%)	RLE: [1×3, 2×3, 3×3] → [1×500, 2×300, ...]
song_name	36B → 14B (61%)	400MB → 50MB (88%)	Dictionary: 5 unique → 10M unique, top 1K = 80% of listens
rating	36B → 18B (50%)	400MB → 40MB (90%)	Quantization: NULL bitmap → 0.1 precision = 1 byte
listen_time	180B → 60B (67%)	2GB → 200MB (90%)	Delta: Random timestamps → binge sessions compress 10:1
Total	432B → 98B (77%)	4.8GB → 298MB (94%)	More data = more repetition patterns

The Scaling Law: Small datasets (77% compression) → Massive datasets (94%+ compression)

Why? Repetition patterns emerge at scale: Popular songs played millions of times, active users with thousands of listens, predictable timestamp sequences. Rows can compress, but not nearly as much (mix of different data types inside a row)

This is why BigQuery, Snowflake, and Redshift can scan petabytes efficiently - they're reading highly compressed columnar data, not just columnar data.

The Core Trade-off: Query Patterns Drive Everything

Different query patterns need different storage layouts. Here's the fundamental divide:

OLTP: Row Storage Wins

-- Get full user listening session
SELECT * FROM Listens 
WHERE user_id = 1 
  AND listen_time > '2024-08-30';

-- Update a single rating
UPDATE Listens 
SET rating = 5.0 
WHERE listen_id = 1;

-- Insert new listen
INSERT INTO Listens VALUES 
(10, 1, 3, 4.8, NOW());

Why Row Storage?

Need all columns together
Frequent updates/inserts
Single record operations
Examples: PostgreSQL, MySQL

OLAP: Columnar Wins

-- Average rating per genre
SELECT genre, AVG(rating)
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
GROUP BY genre;

-- Top songs last month
SELECT song_id, COUNT(*) as plays
FROM Listens
WHERE listen_time > '2024-07-01'
GROUP BY song_id
ORDER BY plays DESC;
--

Why Columnar?

Only need specific columns
Aggregations over many rows
Read-heavy, append-only
Examples: BigQuery, Snowflake

Key Takeaways

Layout Matters: How you organize data in 64MB pages
Query Pattern = Storage Choice: OLTP (many small updates) → Row storage (PostgreSQL, MySQL). OLAP (few large scans) → Columnar storage (BigQuery, Snowflake)
Compression Magic: Columnar can get 77% → 99%+ compression as data scales