Storage Layout: Row vs Columnar

How Tables Live on Disk: From OS Files to DbFiles

OS Files: The operating system supports basic files (any size, unstructured bytes).

DbFiles: are the DB's special files - structured as fixed 64MB pages. Every database table lives on disk as a DbFile.

Spotify Example:

Listens Table → listens.dbf (massive DbFile, 1000 pages = 64GB total)
├── Page 0: Contains 64MB of Listens data  
├── Page 1: Contains 64MB of Listens data
├── Page 2: Contains 64MB of Listens data  
├── Page 3: Contains 64MB of Listens data
└── ... (997 more pages)

Songs Table → songs.dbf (smaller DbFile, 10 pages = 640MB total)
├── Page 0: Contains 64MB of Songs data
├── Page 1: Contains 64MB of Songs data  
└── ... (8 more pages)

Why DbFiles Matter: Each 64MB page is the unit of I/O. When you query data, the database reads entire 64MB pages from disk to memory, not individual records.

Why 64MB pages? Recall IO Costs. IO Devices are slow. Bulk operations (e.g. one 64MB IO vs reading 2000 separate IOs for 32kb pages) help amortize costs.

Inside a 64MB Page: The Storage Layout Question

Now that we understand DbFiles are made of 64MB pages, the critical question becomes: How do we organize data within each 64MB page?

The visualization below shows the two fundamental approaches for organizing data inside a single 64MB page from our Listens table:

One 64MB Listens Table Page: Two Storage Layouts Row Storage (Traditional) Row 1: [1, 1, 1, 4.5, "2024-08-30 14:35:00"] Row 2: [2, 1, 2, 4.2, NULL] Row 3: [3, 1, 6, 3.9, "2024-08-29 10:15:00"] Row 4: [4, 2, 2, 4.7, NULL] ... 5 more rows ... Page Metadata: â€Ē Record count: T(R) = 9 â€Ē Free space: 63.9MB â€Ē Row pointers: [0, 35, 70, 105...] â€Ē Checksum: 0xABCD1234 â€Ē Page version: 3 Access Pattern: Read Entire Rows SELECT * FROM Listens WHERE user_id = 1 → Reads 3 complete rows (all columns) Row Storage Metrics ✓ Fast single record access ✓ Efficient for OLTP Inefficiencies ✗ Wastes IO for analytics ✗ Poor compression Columnar Storage (Analytics) listen_id: [1,2,3,4,5,6,7,8,9] Compressed: T(R) × 4B → 12B user_id: [1,1,1,2,2,2,3,3,3] RLE: "1×3,2×3,3×3" → 6B song_id: [1,2,6,2,7,8,1,2,6] Dict: {1,2,6,7,8} → 5B + T(R)×1B rating: [4.5,4.2,3.9,4.7,4.6,3.9,2.9,4.9,NULL] Float compress: T(R)×4B → 18B listen_time: ["2024-08-30 14:35:00", NULL, "2024-08-29 10:15:00", ...] Timestamp compress: T(R)×20B → 60B Column Statistics: â€Ē Min/Max values per column â€Ē Null bitmaps â€Ē Compression: 101B total (vs T(R)×48B = 432B row) â€Ē Encoding: RLE, Dictionary, Delta Access Pattern: Scan Columns SELECT AVG(rating) FROM Listens WHERE user_id = 1 → Reads only 2 columns (user_id, rating) Columnar Benefits ✓ 84% compression ratio ✓ Vectorized operations Trade-offs ✗ Slow single record ✗ Poor for updates

Compression Magic for Columnar: The Scale Effect

Using our Spotify Listens table, see how compression ratios improve dramatically with scale in below illustrative Example:

Column Small Scale (T(R) = 9) Massive Scale (T(R) = 100M) Compression Technique & Scaling
user_id 36B → 6B (83%) 400MB → 8MB (98%) RLE: [1×3, 2×3, 3×3] → [1×500, 2×300, ...]
song_id 36B → 14B (61%) 400MB → 50MB (88%) Dictionary: 5 unique → 10M unique, top 1K = 80% of listens
rating 36B → 18B (50%) 400MB → 40MB (90%) Quantization: NULL bitmap → 0.1 precision = 1 byte
listen_time 180B → 60B (67%) 2GB → 200MB (90%) Delta: Random timestamps → binge sessions compress 10:1
ðŸŽŊ Total 432B → 98B (77%) 4.8GB → 298MB (94%) More data = more repetition patterns

The Scaling Law: Small datasets (77% compression) → Massive datasets (94%+ compression)

Why? Repetition patterns emerge at scale: Popular songs played millions of times, active users with thousands of listens, binge-listening sessions create predictable timestamp sequences. Rows can compress, but not as much, because it's often not the same data types.

This is why BigQuery, Snowflake, and Redshift can scan petabytes efficiently - they're reading highly compressed columnar data, not just columnar data.


The Core Trade-off: Query Patterns Drive Everything

Different query patterns need different storage layouts. Here's the fundamental divide:

OLTP: Row Storage Wins

-- Get full user listening session
SELECT * FROM Listens 
WHERE user_id = 1 
  AND listen_time > '2024-08-30';

-- Update a single rating
UPDATE Listens 
SET rating = 5.0 
WHERE listen_id = 1;

-- Insert new listen
INSERT INTO Listens VALUES 
(10, 1, 3, 4.8, NOW());

Why Row Storage?

  • Need all columns together
  • Frequent updates/inserts
  • Single record operations
  • Examples: PostgreSQL, MySQL

Performance Reality:

Operation Row Storage Performance
SELECT * WHERE id = 1 10ξs ✓
UPDATE SET rating = 5 10ξs ✓
INSERT single row 10ξs ✓

Row storage dominates for transactional workloads

Analytics: Columnar Dominates

-- Average rating per genre
SELECT genre, AVG(rating)
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
GROUP BY genre;

-- Top songs last month
SELECT song_id, COUNT(*) as plays
FROM Listens
WHERE listen_time > '2024-07-01'
GROUP BY song_id
ORDER BY plays DESC;
--

Why Columnar?

  • Only need specific columns
  • Aggregations over many rows
  • Read-heavy, append-only
  • Examples: BigQuery, Snowflake

Performance Reality:

Operation Columnar Storage Performance
SELECT AVG(rating) 5ms ✓
GROUP BY artist 100ms ✓
Bulk load 1M rows 1s ✓

Columnar storage dominates for analytical workloads


Real-World Systems: From Single Machine to Distributed

Storage format and scale are independent choices. Let's see how they combine in practice:

Single Machine Systems

Row Storage: Single Machine

OLTP Systems
PostgreSQL, MySQL, SQLite

┌───────────────────────────────────────────┐
│           64MB Page in RAM                │
├───────────────────────────────────────────â”Ī
│ Row 1: [listen_id=1, user_id=1, ...]     │
│ Row 2: [listen_id=2, user_id=1, ...]     │
│ Row 3: [listen_id=3, user_id=1, ...]     │
│ Row 4: [listen_id=4, user_id=2, ...]     │
│ Row 5: [listen_id=5, user_id=2, ...]     │
│          ... thousands more rows ...       │
└───────────────────────────────────────────┘

Use Cases: - Transactions, < 1TB data
- Mobile apps (SQLite) - Web applications - Single-user workloads

Columnar Storage: Single Machine

Local Analytics
DuckDB, Parquet files

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Page 1    │  │   Page 2    │  │   Page 3    │
│   user_id   │  │   song_id   │  │   rating    │
├─────────────â”Ī  ├─────────────â”Ī  ├─────────────â”Ī
│ [1,1,1,2,2, │  │ [1,2,6,2,7, │  │ [4.5,4.2,   │
│  2,3,3,3,1, │  │  8,1,2,6,3, │  │  3.9,4.7,   │
│  1,1,2,2,3] │  │  4,5,7,8,9] │  │  4.6,3.9,â€Ķ] │
└─────────────┘  └─────────────┘  └─────────────┘
        ↓                ↓                ↓
   Compressed       Dictionary        Quantized

Use Cases: - Data science on laptop
- Parquet: Standard format - Local analytics - Research projects

Distributed Systems

Row Storage: Distributed

OLTP at Scale
Aurora, Google Spanner

    Machine 1                    Machine 2
┌─────────────────────┐    ┌─────────────────────┐
│    64MB Pages       │    │    64MB Pages       │
├─────────────────────â”Ī    ├─────────────────────â”Ī
│ Rows 1-50K          │    │ Rows 50K-100K       │
│ [1,1,1,4.5,...]     │    │ [2,2,2,4.7,...]     │
│ [2,1,2,4.2,...]     │    │ [3,2,7,3.9,...]     │
│ [3,1,6,3.9,...]     │    │ [4,3,8,4.1,...]     │
│ ... thousands ...    │    │ ... thousands ...    │
└─────────────────────┘    └─────────────────────┘
         │                           │
    ┌────â”ī────┐                 ┌────â”ī────┐
    │ Replica │                 │ Replica │
    └─────────┘                 └─────────┘

Use Cases: - Global transactions - 99.999% availability
- Banking, e-commerce - Multi-region ACID

Columnar Storage: Distributed

OLAP at Scale
BigQuery, Snowflake, Redshift

    Machine 1                    Machine 2
┌─────────────────────┐    ┌─────────────────────┐
│   Column Chunks     │    │   Column Chunks     │
├─────────────────────â”Ī    ├─────────────────────â”Ī
│ user_id (100M vals) │    │ song_id (100M vals) │
│ [1,1,1,2,2,2,...]   │    │ [1,2,6,2,7,8,...]   │
│ Compressed: 2MB     │    │ Compressed: 15MB    │
├─────────────────────â”Ī    ├─────────────────────â”Ī
│ rating (100M vals)  │    │ timestamps (100M)   │
│ [4.5,4.2,3.9,...]   │    │ [2024-08-30,...]    │
│ Compressed: 10MB    │    │ Compressed: 20MB    │
└─────────────────────┘    └─────────────────────┘
         ↑                           ↑
   50GB → 47MB                  50GB → 35MB

Use Cases: - PB-scale analytics - Data warehousing - Real-time reporting
- Machine learning

The Pattern


Key Takeaways

  1. Visual Layout Matters: How you organize data in 64MB pages determines everything

  2. Query Pattern = Storage Choice:

  3. OLTP (many small updates) → Row storage (PostgreSQL, MySQL)
  4. OLAP (few large scans) → Columnar storage (BigQuery, Snowflake)

  5. Compression Magic: Columnar storage gets 77% → 94%+ compression as data scales

  6. Performance Reality: Each format dominates its use case (see the comparison table)

  7. Scale Independently: Row/columnar + single machine/distributed are separate choices

  8. Modern Analytics: Columnar + compression + distribution = scan petabytes in seconds