Storage Layout: Row vs Columnar
How Tables Live on Disk: From OS Files to DbFiles
OS Files: The operating system supports basic files (any size, unstructured bytes).
DbFiles: are the DB's special files - structured as fixed 64MB pages. Every database table lives on disk as a DbFile.
Spotify Example:
Listens Table â listens.dbf (massive DbFile, 1000 pages = 64GB total)
âââ Page 0: Contains 64MB of Listens data
âââ Page 1: Contains 64MB of Listens data
âââ Page 2: Contains 64MB of Listens data
âââ Page 3: Contains 64MB of Listens data
âââ ... (997 more pages)
Songs Table â songs.dbf (smaller DbFile, 10 pages = 640MB total)
âââ Page 0: Contains 64MB of Songs data
âââ Page 1: Contains 64MB of Songs data
âââ ... (8 more pages)
Why DbFiles Matter: Each 64MB page is the unit of I/O. When you query data, the database reads entire 64MB pages from disk to memory, not individual records.
Why 64MB pages? Recall IO Costs. IO Devices are slow. Bulk operations (e.g. one 64MB IO vs reading 2000 separate IOs for 32kb pages) help amortize costs.
Inside a 64MB Page: The Storage Layout Question
Now that we understand DbFiles are made of 64MB pages, the critical question becomes: How do we organize data within each 64MB page?
The visualization below shows the two fundamental approaches for organizing data inside a single 64MB page from our Listens table:
Compression Magic for Columnar: The Scale Effect
Using our Spotify Listens table, see how compression ratios improve dramatically with scale in below illustrative Example:
| Column | Small Scale (T(R) = 9) | Massive Scale (T(R) = 100M) | Compression Technique & Scaling |
|---|---|---|---|
| user_id | 36B â 6B (83%) | 400MB â 8MB (98%) | RLE: [1Ã3, 2Ã3, 3Ã3] â [1Ã500, 2Ã300, ...] |
| song_id | 36B â 14B (61%) | 400MB â 50MB (88%) | Dictionary: 5 unique â 10M unique, top 1K = 80% of listens |
| rating | 36B â 18B (50%) | 400MB â 40MB (90%) | Quantization: NULL bitmap â 0.1 precision = 1 byte |
| listen_time | 180B â 60B (67%) | 2GB â 200MB (90%) | Delta: Random timestamps â binge sessions compress 10:1 |
| ðŊ Total | 432B â 98B (77%) | 4.8GB â 298MB (94%) | More data = more repetition patterns |
The Scaling Law: Small datasets (77% compression) â Massive datasets (94%+ compression)
Why? Repetition patterns emerge at scale: Popular songs played millions of times, active users with thousands of listens, binge-listening sessions create predictable timestamp sequences. Rows can compress, but not as much, because it's often not the same data types.
This is why BigQuery, Snowflake, and Redshift can scan petabytes efficiently - they're reading highly compressed columnar data, not just columnar data.
The Core Trade-off: Query Patterns Drive Everything
Different query patterns need different storage layouts. Here's the fundamental divide:
OLTP: Row Storage Wins
-- Get full user listening session
SELECT * FROM Listens
WHERE user_id = 1
AND listen_time > '2024-08-30';
-- Update a single rating
UPDATE Listens
SET rating = 5.0
WHERE listen_id = 1;
-- Insert new listen
INSERT INTO Listens VALUES
(10, 1, 3, 4.8, NOW());
Why Row Storage?
- Need all columns together
- Frequent updates/inserts
- Single record operations
- Examples: PostgreSQL, MySQL
Performance Reality:
| Operation | Row Storage Performance |
|---|---|
| SELECT * WHERE id = 1 | 10Ξs â |
| UPDATE SET rating = 5 | 10Ξs â |
| INSERT single row | 10Ξs â |
Row storage dominates for transactional workloads
Analytics: Columnar Dominates
-- Average rating per genre
SELECT genre, AVG(rating)
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
GROUP BY genre;
-- Top songs last month
SELECT song_id, COUNT(*) as plays
FROM Listens
WHERE listen_time > '2024-07-01'
GROUP BY song_id
ORDER BY plays DESC;
--
Why Columnar?
- Only need specific columns
- Aggregations over many rows
- Read-heavy, append-only
- Examples: BigQuery, Snowflake
Performance Reality:
| Operation | Columnar Storage Performance |
|---|---|
| SELECT AVG(rating) | 5ms â |
| GROUP BY artist | 100ms â |
| Bulk load 1M rows | 1s â |
Columnar storage dominates for analytical workloads
Real-World Systems: From Single Machine to Distributed
Storage format and scale are independent choices. Let's see how they combine in practice:
Single Machine Systems
Row Storage: Single Machine
OLTP Systems
PostgreSQL, MySQL, SQLite
âââââââââââââââââââââââââââââââââââââââââââââ
â 64MB Page in RAM â
âââââââââââââââââââââââââââââââââââââââââââââĪ
â Row 1: [listen_id=1, user_id=1, ...] â
â Row 2: [listen_id=2, user_id=1, ...] â
â Row 3: [listen_id=3, user_id=1, ...] â
â Row 4: [listen_id=4, user_id=2, ...] â
â Row 5: [listen_id=5, user_id=2, ...] â
â ... thousands more rows ... â
âââââââââââââââââââââââââââââââââââââââââââââ
Use Cases:
- Transactions, < 1TB data
- Mobile apps (SQLite)
- Web applications
- Single-user workloads
Columnar Storage: Single Machine
Local Analytics
DuckDB, Parquet files
âââââââââââââââ âââââââââââââââ âââââââââââââââ
â Page 1 â â Page 2 â â Page 3 â
â user_id â â song_id â â rating â
âââââââââââââââĪ âââââââââââââââĪ âââââââââââââââĪ
â [1,1,1,2,2, â â [1,2,6,2,7, â â [4.5,4.2, â
â 2,3,3,3,1, â â 8,1,2,6,3, â â 3.9,4.7, â
â 1,1,2,2,3] â â 4,5,7,8,9] â â 4.6,3.9,âĶ] â
âââââââââââââââ âââââââââââââââ âââââââââââââââ
â â â
Compressed Dictionary Quantized
Use Cases:
- Data science on laptop
- Parquet: Standard format
- Local analytics
- Research projects
Distributed Systems
Row Storage: Distributed
OLTP at Scale
Aurora, Google Spanner
Machine 1 Machine 2
âââââââââââââââââââââââ âââââââââââââââââââââââ
â 64MB Pages â â 64MB Pages â
âââââââââââââââââââââââĪ âââââââââââââââââââââââĪ
â Rows 1-50K â â Rows 50K-100K â
â [1,1,1,4.5,...] â â [2,2,2,4.7,...] â
â [2,1,2,4.2,...] â â [3,2,7,3.9,...] â
â [3,1,6,3.9,...] â â [4,3,8,4.1,...] â
â ... thousands ... â â ... thousands ... â
âââââââââââââââââââââââ âââââââââââââââââââââââ
â â
ââââââīâââââ ââââââīâââââ
â Replica â â Replica â
âââââââââââ âââââââââââ
Use Cases:
- Global transactions
- 99.999% availability
- Banking, e-commerce
- Multi-region ACID
Columnar Storage: Distributed
OLAP at Scale
BigQuery, Snowflake, Redshift
Machine 1 Machine 2
âââââââââââââââââââââââ âââââââââââââââââââââââ
â Column Chunks â â Column Chunks â
âââââââââââââââââââââââĪ âââââââââââââââââââââââĪ
â user_id (100M vals) â â song_id (100M vals) â
â [1,1,1,2,2,2,...] â â [1,2,6,2,7,8,...] â
â Compressed: 2MB â â Compressed: 15MB â
âââââââââââââââââââââââĪ âââââââââââââââââââââââĪ
â rating (100M vals) â â timestamps (100M) â
â [4.5,4.2,3.9,...] â â [2024-08-30,...] â
â Compressed: 10MB â â Compressed: 20MB â
âââââââââââââââââââââââ âââââââââââââââââââââââ
â â
50GB â 47MB 50GB â 35MB
Use Cases:
- PB-scale analytics
- Data warehousing
- Real-time reporting
- Machine learning
The Pattern
-
Storage format (row vs columnar) = query pattern (OLTP vs OLAP)
-
Scale (single vs distributed) = data size + availability needs
-
Real systems like BigQuery combine columnar + compression + distribution
Key Takeaways
-
Visual Layout Matters: How you organize data in 64MB pages determines everything
-
Query Pattern = Storage Choice:
- OLTP (many small updates) â Row storage (PostgreSQL, MySQL)
-
OLAP (few large scans) â Columnar storage (BigQuery, Snowflake)
-
Compression Magic: Columnar storage gets 77% â 94%+ compression as data scales
-
Performance Reality: Each format dominates its use case (see the comparison table)
-
Scale Independently: Row/columnar + single machine/distributed are separate choices
-
Modern Analytics: Columnar + compression + distribution = scan petabytes in seconds