LSM Tree Indexes
Concept. An LSM tree converts random writes into sequential disk writes by buffering inserts in a RAM "MemTable", then flushing them as immutable SSTables that are merged in the background.
Intuition. When Spotify ingests 1 million listen events per second, B+Trees can't keep up — every insert is a random disk write. LSM trees absorb all writes into RAM first, then dump them to disk as a single sequential file. Reads check RAM first, then the on-disk SSTables.
Converting Random I/O to Sequential Throughput
LSM (Log-Structured Merge) Trees are the unsung heroes of high-volume write operations. They sidestep the inefficiencies of random disk updates by batching changes in memory and laying them down as sorted files.
O(1)
Write Path (RAM)
Sequential
Disk Storage
Merge-on-Read
Logic for State
Analytics
Primary Use Case
Block 1: The Atomic Unit of Data
Let's start with the basics. In an LSM tree, every record is a straightforward (Key, Value) pair, often timestamped.
Use Case: Real-time "Song Play" Counting
Consider tracking the number of plays on viral songs.
-
Key: Song ID (e.g.,
S_102) -
Value: Increment (e.g.,
+1play)
(S_102, +1, 13:42:05)
Key: Song 102
Value: +1 Play
Time: 1:42 PM
Block 2: The Production Path (Snapshot @ 1:44 PM)
Data moves from RAM to Disk. At 1:44 PM, the system reflects user actions, periodic flushes, and hourly compactions.
Block 3: The Consumption Path (Merge on Read)
For Song 75, play counts are scattered across RAM and Disk, necessitating a cross-tier "Merge on Read".
Core Concept Definitions
The LSM tree's efficiency hinges on four key operations:
-
Flush: Converts a full MemTable in RAM into a sorted, immutable SSTable on disk.
-
Merge on Read: Combines results from all possible locations (RAM, Level 0, Level 1) to return the current state.
-
Compaction: Merges multiple SSTables into a larger one, eliminating duplicates and outdated records.
-
Policy: Dictates when and how to compact.
LSM State Functions: Defining "The Truth"
In LSM trees, the "State" isn't stored; it's computed. Choose your merge function based on your application:
| Pattern | Logic | Use Case Example |
|---|---|---|
| Cumulative (Counting) | $SUM(v_1, v_2, ..., v_n)$ | Song Play Counts: Summing +1 fragments across files. |
| Latest Wins (Upsert) | $MAX(timestamp)$ | User Profiles: The most recent update "deletes" older values. |
| Statistical (Averages) | $Sum(sums), Sum(counts)$ | Sensor Monitoring: Maintaining rolling averages of temps. |
| Retention (TTL) | $Filter(time > now - W)$ | Session Logs: Automatically dropping data older than 24h. |
Decision Matrix
| Aspect | B+Tree Index | LSM Tree Index |
|---|---|---|
| Primary I/O Pattern | Random Disk Seek | Sequential Throughput |
| Write Cost | Expensive (Random) | Negligible (Buffered) |
| Read Cost | Predicted (Fixed) | Variable (Hunt across levels) |
| Fragmentation | Low (Internal) | High (External - requires cleaning) |
| Best For | Banking, ERP, OLTP | Analytics, Logs, Real-time Counters |
Key Takeaways
-
Sequential is King: LSM transforms chaotic random writes into a streamlined sequence.
-
State is Computed: In LSM trees, the "truth" is a calculated merge of current fragments.
-
The Compaction Tax: Fast writes come at the cost of background merges to stabilize read performance.
-
Real Systems: If you're using Spanner, Cassandra, RocksDB, or Bigtable, LSM trees are running the show.