LSM Trees

LSM Trees: Converting Random IO to Sequential

Problem: 100M song plays/hour = 100M random disk writes = 💀 disk death
Solution: Batch in memory → Flush sequentially → Merge on read

0 IOs
Writes to memory
Sequential
Flush to disk
N Files
Check on read

The Problem: Spotify's 100M Plays/Hour

-- The naive B+Tree approach kills your disk:
UPDATE song_stats SET play_count = play_count + 1 WHERE song_id = 75;
-- Problem: Random IO for EVERY play event!
-- 100M plays/hour = 100M random IOs = disk death 

The LSM Solution: Batch → Flush → Merge

Instead of 100M random IOs: Batch updates in memory → Flush sequentially when full → Merge results on read

Example: Multiple SSTables After Hours of Listening

┌────────────────────────┬────────────────────────┬────────────────────────┐
│ sstable_1.db (10MB)    │ sstable_2.db (10MB)    │ sstable_3.db (10MB)    │
│ Flushed: 2hr ago       │ Flushed: 1hr ago       │ Flushed: 30min ago     │
├────────────────────────┼────────────────────────┼────────────────────────┤
│ song_id=75: 12,450     │ song_id=75: 3,890      │ song_id=75: 567        │
│ song_id=89: 8,234      │ song_id=101: 15,678    │ song_id=89: 2,345      │
└────────────────────────┴────────────────────────┴────────────────────────┘

⚠️ To get total plays for song_id=75, must check ALL files:
→ 12,450 + 3,890 + 567 = 16,907 total plays (that's the trade-off!)

Visual: LSM Tree in Action

LSM Tree: Spotify Song Play Counter (song_id=75) Write Path (Fast) Incoming Listen Events song_id=75 +1, song_id=89 +1, song_id=75 +1... MemTable (In RAM) song_id=75: 892 song_id=89: 234 song_id=101: 567 song_id=120: 1,205 No disk IO! Super fast! When full → Flush Sequential write! SSTables on Disk sstable_1.db: song_id=75:12,450... 10MB sstable_2.db: song_id=75:3,890... 10MB sstable_3.db: song_id=75:567... 10MB Each file is sorted by song_id Read Path (Merge on Read) Query: GetPlayCount(song_id=75) Must check memory + all files! Merge Results 1. Check MemTable: song_id=75: 892 2. Check sstable_1: song_id=75: 12,450 3. Check sstable_2: song_id=75: 3,890 4. Check sstable_3: song_id=75: 567 Total: 75: 892 + 12,450 + 3,890 + 567 = 17,799 Slower than B+Tree! Background Compaction Before sstable_1: song_id=75:12,450 sstable_2: song_id=75:3,890 sstable_3: song_id=75:567 3 files to check Merge After merged.db: song_id=75:16,907 1 file to check! Compaction Benefits • Reduces number of files to check • Merges duplicate keys • Improves read performance B+Tree vs LSM Trade-offs B+Tree: ✓ Fast reads (3-4 IOs) ✗ Random IO for writes Best for: Transactions LSM Tree: ✓ Fast writes (memory only) ✗ Slower reads (check multiple files) Best for: Analytics, Logs, Counters

Core LSM Operations: Merge vs Compact

Write(key, value):  memtable[key] = value         // 0 IOs - just memory!
Flush():           memtable → SSTable            // Sequential write when full
Merge on Read:    combine values from N files   // SUM, UNION, or LATEST
Compact:          merge SSTables → fewer files  // Background optimization

Real-World Patterns: How Merge Works

Counters (Spotify Plays)

Write:     Batch in memtable[song_id] += 1
Flush:     memtable → SSTable when full
Merge:     SUM(12,450 + 3,890 + 567) = 16,907
Compact:   3 files → 1 file with final sum

Time-Series (IoT Sensors)

Write:     Append to memtable with timestamp
Flush:     Sorted by time → SSTable  
Merge:     CONCAT files, sort by timestamp
Compact:   Keep recent, discard old > 7 days

Key Point: "Merge on Read" combines values from multiple SSTables based on your data type:
Counters: Merge = SUM all values
Logs: Merge = UNION all entries
State: Merge = LATEST value wins


Write Amplification vs Read Amplification

"Amplification" = Doing more work than necessary: Example numbers below

Pick your poison: Slow writes (B+Tree) or slow reads (LSM)

The Fundamental Trade-off

Operation B+Tree LSM Tree
Write 1 row 1 random IO 0 IO (memory)
Write 1M rows 1M random IOs 10 sequential IOs
Read 1 row 3-4 IOs Check N files
Good for OLTP, Transactions Analytics, Time-series

Real-World LSM Systems

Cassandra / HBase

RocksDB (Facebook)

Time-Series Databases


When to Use LSM Trees

✅ Perfect For:

❌ Avoid For:


Key Takeaways