Storage Layout: Row vs Columnar

How Tables Live on Disk: From OS Files to DbFiles

OS Files: The operating system supports basic files (any size, unstructured bytes).

DbFiles: are the DB's special files - structured as fixed 64MB pages. Every database table lives on disk as a DbFile.

Spotify Example:

Listens Table → listens.dbf (massive DbFile, 1000 pages = 64GB total)
├── Page 0: Contains 64MB of Listens data  
├── Page 1: Contains 64MB of Listens data
├── Page 2: Contains 64MB of Listens data  
├── Page 3: Contains 64MB of Listens data
└── ... (997 more pages)

Songs Table → songs.dbf (smaller DbFile, 10 pages = 640MB total)
├── Page 0: Contains 64MB of Songs data
├── Page 1: Contains 64MB of Songs data  
└── ... (8 more pages)

Why DbFiles Matter: Each 64MB page is the unit of I/O. When you query data, the database reads entire 64MB pages from disk to memory, not individual records.

Why 64MB pages? Recall IO Costs. IO Devices are slow. Bulk operations (e.g. one 64MB IO vs reading 2000 separate IOs for 32kb pages) help amortize costs.

Inside a 64MB Page: The Storage Layout Question

  1. Row Storage

  2. Columnar Storage

  3. Hybrid Storage

1. Row-Oriented Storage: The Slotted Page

Inside a 64MB page, the database uses a Slots design to handle variable-length rows and fragmented space.

INCOMING RAW RECORDS (FULL ROWS) RAW: [ID:001, User:42, Name:"Lullaby", Rate:4.5], [ID:002, User:8, Name:"Intro"...] PAGE 1: FIXED-SIZE HEADER ROW 1: (96 bytes) ROW 2: (112 bytes) ROW 3: (88 bytes) DATA GROWS DOWN FREE SPACE SLOTS GROW UP SLOT Pointers Slot 0 → Ptr Row 1 Ptr Row 2 Ptr Row 3 ...growing left...

2. Columnar Storage: Segments & Compression

Instead of one page containing many rows, one page typically contains values for one specific column for many rows.

INCOMING RAW DATA STREAM (REPETITIVE/REDUNDANT) RAW: [1, 1, 1, 1, 1, 2, 2, ...] COLUMN: user_id META: RLE Encoding 0x001 [Count: 520] 0x002 [Count: 140] 0x042 [Count: 88] Billions of IDs packed... PAGINATED: PAGE 1 → 2400 RAW: [4.5, 4.2, 5.0, 3.8, ...] COLUMN: rating META: Bit-Packing 101101 (4.5) 101010 (4.2) 110010 (5.0) Quantized ratings... PAGINATED: PAGE 1 → 800 RAW: ["Lullaby", "Intro", "Lullaby", ...] COLUMN: song_name DICT MAP (Shared Meta) 0: "Lullaby" | 1: "Intro" 2: "Blues" | 3: "Jazz" ... (pruning overhead) 0, 0, 1, 0, 2, 0, 1... 1, 1, 0, 3, 2, 2, 0... Billions of song names packed into dict indices PAGINATED: PAGE 1 → 15000

Compression Magic for Columnar: The Scale Effect

Compression ratios improve dramatically with more data. Example Spotify Listens table:

Column Small Scale (T(R) = 9) Massive Scale (T(R) = 100M) Compression Technique & Scaling
user_id 36B → 6B (83%) 400MB → 8MB (98%) RLE: [1×3, 2×3, 3×3] → [1×500, 2×300, ...]
song_name 36B → 14B (61%) 400MB → 50MB (88%) Dictionary: 5 unique → 10M unique, top 1K = 80% of listens
rating 36B → 18B (50%) 400MB → 40MB (90%) Quantization: NULL bitmap → 0.1 precision = 1 byte
listen_time 180B → 60B (67%) 2GB → 200MB (90%) Delta: Random timestamps → binge sessions compress 10:1
Total 432B → 98B (77%) 4.8GB → 298MB (94%) More data = more repetition patterns

The Scaling Law: Small datasets (77% compression) → Massive datasets (94%+ compression)

Why? Repetition patterns emerge at scale: Popular songs played millions of times, active users with thousands of listens, predictable timestamp sequences. Rows can compress, but not nearly as much (mix of different data types inside a row)

This is why BigQuery, Snowflake, and Redshift can scan petabytes efficiently - they're reading highly compressed columnar data, not just columnar data.


The Core Trade-off: Query Patterns Drive Everything

Different query patterns need different storage layouts. Here's the fundamental divide:

OLTP: Row Storage Wins

-- Get full user listening session
SELECT * FROM Listens 
WHERE user_id = 1 
  AND listen_time > '2024-08-30';

-- Update a single rating
UPDATE Listens 
SET rating = 5.0 
WHERE listen_id = 1;

-- Insert new listen
INSERT INTO Listens VALUES 
(10, 1, 3, 4.8, NOW());

Why Row Storage?

  • Need all columns together
  • Frequent updates/inserts
  • Single record operations
  • Examples: PostgreSQL, MySQL

OLAP: Columnar Wins

-- Average rating per genre
SELECT genre, AVG(rating)
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
GROUP BY genre;

-- Top songs last month
SELECT song_id, COUNT(*) as plays
FROM Listens
WHERE listen_time > '2024-07-01'
GROUP BY song_id
ORDER BY plays DESC;
--

Why Columnar?

  • Only need specific columns
  • Aggregations over many rows
  • Read-heavy, append-only
  • Examples: BigQuery, Snowflake

Key Takeaways

  1. Layout Matters: How you organize data in 64MB pages

  2. Query Pattern = Storage Choice: OLTP (many small updates) → Row storage (PostgreSQL, MySQL). OLAP (few large scans) → Columnar storage (BigQuery, Snowflake)

  3. Compression Magic: Columnar can get 77% → 99%+ compression as data scales

Modern Analytics: Columnar + compression + distribution = scan petabytes in seconds