Storage Layout: Row vs Columnar
How Tables Live on Disk: From OS Files to DbFiles
OS Files: Think of these as the OS's basic building blocks—unstructured bytes with no rules.
DbFiles: These are the database's tailored files, neatly packed into fixed 64MB pages. Every table you care about exists as a DbFile.
Why DbFiles Matter: Each 64MB page is your I/O workhorse. When you query, the database hauls in whole 64MB chunks, not piecemeal records.
Why 64MB pages? I/O devices are sloths. Bulk operations—like one 64MB read versus 2000 separate 32KB reads—keep costs in check.
Inside a 64MB Page: The Storage Layout Question
-
Row Storage
-
Columnar Storage
-
Hybrid Storage
1. Row-Oriented Storage: The Slotted Page
In a 64MB page, the database uses a Slots design to juggle variable-length rows and fragmented space.
-
Data grows from the top; slots from the bottom.
-
Update a row in place if its size doesn't change.
-
If it grows, other rows might move, but only the Slot Array pointer updates.
2. Columnar Storage: Segments & Compression
Here, a page typically holds values for one specific column across many rows.
Compression Magic for Columnar: The Scale Effect
More data means better compression. Take Spotify's Listens table:
| Column | Small Scale (T(R) = 9) | Massive Scale (T(R) = 100M) | Compression Technique & Scaling |
|---|---|---|---|
| user_id | 36B → 6B (83%) | 400MB → 8MB (98%) | RLE: [1×3, 2×3, 3×3] → [1×500, 2×300, ...] |
| song_name | 36B → 14B (61%) | 400MB → 50MB (88%) | Dictionary: 5 unique → 10M unique, top 1K = 80% of listens |
| rating | 36B → 18B (50%) | 400MB → 40MB (90%) | Quantization: NULL bitmap → 0.1 precision = 1 byte |
| listen_time | 180B → 60B (67%) | 2GB → 200MB (90%) | Delta: Random timestamps → binge sessions compress 10:1 |
| Total | 432B → 98B (77%) | 4.8GB → 298MB (94%) | More data = more repetition patterns |
The Scaling Law: Small datasets (77% compression) → Massive datasets (94%+ compression)
Why? Repetition patterns emerge at scale: Popular songs played millions of times, active users with thousands of listens, predictable timestamp sequences. Rows can compress, but not nearly as much (mix of different data types inside a row).
This is why BigQuery, Snowflake, and Redshift can scan petabytes efficiently—they're reading highly compressed columnar data, not just columnar data.
The Core Trade-off: Query Patterns Drive Everything
Different query patterns demand different storage layouts. Here's the fundamental divide:
OLTP: Row Storage Wins
Why Row Storage?
-
Need all columns together
-
Frequent updates/inserts
-
Single record operations
-
Examples: PostgreSQL, MySQL
OLAP: Columnar Wins
Why Columnar?
-
Only need specific columns
-
Aggregations over many rows
-
Read-heavy, append-only
-
Examples: BigQuery, Snowflake
Key Takeaways
-
Layout Matters: How you organize data in 64MB pages.
-
Query Pattern = Storage Choice: OLTP (many small updates) → Row storage (PostgreSQL, MySQL). OLAP (few large scans) → Columnar storage (BigQuery, Snowflake).
-
Compression Magic: Columnar can get 77% → 99%+ compression as data scales.