LSM Trees
LSM Trees: Converting Random IO to Sequential
Problem: 100M song plays/hour = 100M random disk writes = 💀 disk death
Solution: Batch in memory → Flush sequentially → Merge on read
The Problem: Spotify's 100M Plays/Hour
-- The naive B+Tree approach kills your disk:
UPDATE song_stats SET play_count = play_count + 1 WHERE song_id = 75;
-- Problem: Random IO for EVERY play event!
-- 100M plays/hour = 100M random IOs = disk death
The LSM Solution: Batch → Flush → Merge
Instead of 100M random IOs: Batch updates in memory → Flush sequentially when full → Merge results on read
Example: Multiple SSTables After Hours of Listening
┌────────────────────────┬────────────────────────┬────────────────────────┐
│ sstable_1.db (10MB) │ sstable_2.db (10MB) │ sstable_3.db (10MB) │
│ Flushed: 2hr ago │ Flushed: 1hr ago │ Flushed: 30min ago │
├────────────────────────┼────────────────────────┼────────────────────────┤
│ song_id=75: 12,450 │ song_id=75: 3,890 │ song_id=75: 567 │
│ song_id=89: 8,234 │ song_id=101: 15,678 │ song_id=89: 2,345 │
└────────────────────────┴────────────────────────┴────────────────────────┘
⚠️ To get total plays for song_id=75, must check ALL files:
→ 12,450 + 3,890 + 567 = 16,907 total plays (that's the trade-off!)
Visual: LSM Tree in Action
Core LSM Operations: Merge vs Compact
Write(key, value): memtable[key] = value // 0 IOs - just memory! Flush(): memtable → SSTable // Sequential write when full Merge on Read: combine values from N files // SUM, UNION, or LATEST Compact: merge SSTables → fewer files // Background optimization
Real-World Patterns: How Merge Works
Counters (Spotify Plays)
Write: Batch in memtable[song_id] += 1 Flush: memtable → SSTable when full Merge: SUM(12,450 + 3,890 + 567) = 16,907 Compact: 3 files → 1 file with final sum
Time-Series (IoT Sensors)
Write: Append to memtable with timestamp Flush: Sorted by time → SSTable Merge: CONCAT files, sort by timestamp Compact: Keep recent, discard old > 7 days
Key Point: "Merge on Read" combines values from multiple SSTables based on your data type:
• Counters: Merge = SUM all values
• Logs: Merge = UNION all entries
• State: Merge = LATEST value wins
Write Amplification vs Read Amplification
"Amplification" = Doing more work than necessary: Example numbers below
-
Write amplification (B+Tree): Update 1 row → rewrite entire 64MB page (1 million X amplification!)
-
Read amplification (LSM): Read 1 key → check 10 SSTables (10x amplification)
Pick your poison: Slow writes (B+Tree) or slow reads (LSM)
The Fundamental Trade-off
| Operation | B+Tree | LSM Tree |
|---|---|---|
| Write 1 row | 1 random IO | 0 IO (memory) |
| Write 1M rows | 1M random IOs | 10 sequential IOs |
| Read 1 row | 3-4 IOs | Check N files |
| Good for | OLTP, Transactions | Analytics, Time-series |
Real-World LSM Systems
Cassandra / HBase
-
Wide column stores
-
LSM for all storage
-
Optimized for writes
RocksDB (Facebook)
-
Embedded key-value store
-
Powers many databases
-
Tunable compaction
Time-Series Databases
-
InfluxDB, TimescaleDB
-
Perfect for append-only data
-
Compaction merges old data
When to Use LSM Trees
✅ Perfect For:
-
Event logging: Append-only patterns
-
Analytics: Write once, read occasionally
-
Counters: High-frequency increments
-
Time-series: Sensor data, metrics
-
Message queues: Write-heavy, sequential reads
❌ Avoid For:
-
OLTP: Need fast random reads
-
Frequent updates: Same key modified repeatedly
-
Strong consistency: Read-after-write requirements
-
Low latency reads: Sub-millisecond requirements
Key Takeaways
-
LSM converts random IO to sequential - Batch in memory, flush when full
-
Trade-off: Fast writes, slower reads - Check N files instead of 1 tree
-
Perfect for write-heavy workloads - Analytics, logs, time-series, counters
-
Real systems combine approaches - LSM for writes, B+Tree indexes for critical reads