Storage Layout: Row vs Columnar
How Tables Live on Disk: From OS Files to DbFiles
OS Files: The operating system supports basic files (any size, unstructured bytes).
DbFiles: are the DB's special files - structured as fixed 64MB pages. Every database table lives on disk as a DbFile.
Spotify Example:
Listens Table → listens.dbf (massive DbFile, 1000 pages = 64GB total)
├── Page 0: Contains 64MB of Listens data
├── Page 1: Contains 64MB of Listens data
├── Page 2: Contains 64MB of Listens data
├── Page 3: Contains 64MB of Listens data
└── ... (997 more pages)
Songs Table → songs.dbf (smaller DbFile, 10 pages = 640MB total)
├── Page 0: Contains 64MB of Songs data
├── Page 1: Contains 64MB of Songs data
└── ... (8 more pages)
Why DbFiles Matter: Each 64MB page is the unit of I/O. When you query data, the database reads entire 64MB pages from disk to memory, not individual records.
Why 64MB pages? Recall IO Costs. IO Devices are slow. Bulk operations (e.g. one 64MB IO vs reading 2000 separate IOs for 32kb pages) help amortize costs.
Inside a 64MB Page: The Storage Layout Question
-
Row Storage
-
Columnar Storage
-
Hybrid Storage
1. Row-Oriented Storage: The Slotted Page
Inside a 64MB page, the database uses a Slots design to handle variable-length rows and fragmented space.
-
Grow data from the top and slots from the bottom
-
If a row is updated and stays the same size, we update it in place.
-
If it grows, we might have to move other rows, but we only update the pointer in the Slot Array.
2. Columnar Storage: Segments & Compression
Instead of one page containing many rows, one page typically contains values for one specific column for many rows.
Compression Magic for Columnar: The Scale Effect
Compression ratios improve dramatically with more data. Example Spotify Listens table:
| Column | Small Scale (T(R) = 9) | Massive Scale (T(R) = 100M) | Compression Technique & Scaling |
|---|---|---|---|
| user_id | 36B → 6B (83%) | 400MB → 8MB (98%) | RLE: [1×3, 2×3, 3×3] → [1×500, 2×300, ...] |
| song_name | 36B → 14B (61%) | 400MB → 50MB (88%) | Dictionary: 5 unique → 10M unique, top 1K = 80% of listens |
| rating | 36B → 18B (50%) | 400MB → 40MB (90%) | Quantization: NULL bitmap → 0.1 precision = 1 byte |
| listen_time | 180B → 60B (67%) | 2GB → 200MB (90%) | Delta: Random timestamps → binge sessions compress 10:1 |
| Total | 432B → 98B (77%) | 4.8GB → 298MB (94%) | More data = more repetition patterns |
The Scaling Law: Small datasets (77% compression) → Massive datasets (94%+ compression)
Why? Repetition patterns emerge at scale: Popular songs played millions of times, active users with thousands of listens, predictable timestamp sequences. Rows can compress, but not nearly as much (mix of different data types inside a row)
This is why BigQuery, Snowflake, and Redshift can scan petabytes efficiently - they're reading highly compressed columnar data, not just columnar data.
The Core Trade-off: Query Patterns Drive Everything
Different query patterns need different storage layouts. Here's the fundamental divide:
OLTP: Row Storage Wins
-- Get full user listening session
SELECT * FROM Listens
WHERE user_id = 1
AND listen_time > '2024-08-30';
-- Update a single rating
UPDATE Listens
SET rating = 5.0
WHERE listen_id = 1;
-- Insert new listen
INSERT INTO Listens VALUES
(10, 1, 3, 4.8, NOW());
Why Row Storage?
- Need all columns together
- Frequent updates/inserts
- Single record operations
- Examples: PostgreSQL, MySQL
OLAP: Columnar Wins
-- Average rating per genre
SELECT genre, AVG(rating)
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
GROUP BY genre;
-- Top songs last month
SELECT song_id, COUNT(*) as plays
FROM Listens
WHERE listen_time > '2024-07-01'
GROUP BY song_id
ORDER BY plays DESC;
--
Why Columnar?
- Only need specific columns
- Aggregations over many rows
- Read-heavy, append-only
- Examples: BigQuery, Snowflake
Key Takeaways
-
Layout Matters: How you organize data in 64MB pages
-
Query Pattern = Storage Choice: OLTP (many small updates) → Row storage (PostgreSQL, MySQL). OLAP (few large scans) → Columnar storage (BigQuery, Snowflake)
-
Compression Magic: Columnar can get 77% → 99%+ compression as data scales