LSM Tree Indexes
Converting Random I/O to Sequential Throughput
LSM (Log-Structured Merge) Trees are the unsung heroes of high-volume write operations. They sidestep the inefficiencies of random disk updates by batching changes in memory and laying them down as sorted files.
Block 1: The Atomic Unit of Data
Let's start with the basics. In an LSM tree, every record is a straightforward (Key, Value) pair, often timestamped.
Use Case: Real-time "Song Play" Counting
Consider tracking the number of plays on viral songs.
-
Key: Song ID (e.g.,
S_102) -
Value: Increment (e.g.,
+1play)
(S_102, +1, 13:42:05)
Value: +1 Play
Time: 1:42 PM
Block 2: The Production Path (Snapshot @ 1:44 PM)
Data moves from RAM to Disk. At 1:44 PM, the system reflects user actions, periodic flushes, and hourly compactions.
Block 3: The Consumption Path (Merge on Read)
For Song 75, play counts are scattered across RAM and Disk, necessitating a cross-tier "Merge on Read".
Core Concept Definitions
The LSM tree's efficiency hinges on four key operations:
-
Flush: Converts a full MemTable in RAM into a sorted, immutable SSTable on disk.
-
Merge on Read: Combines results from all possible locations (RAM, Level 0, Level 1) to return the current state.
-
Compaction: Merges multiple SSTables into a larger one, eliminating duplicates and outdated records.
-
Policy: Dictates when and how to compact.
LSM State Functions: Defining "The Truth"
In LSM trees, the "State" isn't stored; it's computed. Choose your merge function based on your application:
| Pattern | Logic | Use Case Example |
|---|---|---|
| Cumulative (Counting) | $SUM(v_1, v_2, ..., v_n)$ | Song Play Counts: Summing +1 fragments across files. |
| Latest Wins (Upsert) | $MAX(timestamp)$ | User Profiles: The most recent update "deletes" older values. |
| Statistical (Averages) | $Sum(sums), Sum(counts)$ | Sensor Monitoring: Maintaining rolling averages of temps. |
| Retention (TTL) | $Filter(time > now - W)$ | Session Logs: Automatically dropping data older than 24h. |
Decision Matrix
| Aspect | B+Tree Index | LSM Tree Index |
|---|---|---|
| Primary I/O Pattern | Random Disk Seek | Sequential Throughput |
| Write Cost | Expensive (Random) | Negligible (Buffered) |
| Read Cost | Predicted (Fixed) | Variable (Hunt across levels) |
| Fragmentation | Low (Internal) | High (External - requires cleaning) |
| Best For | Banking, ERP, OLTP | Analytics, Logs, Real-time Counters |
Key Takeaways
-
Sequential is King: LSM transforms chaotic random writes into a streamlined sequence.
-
State is Computed: In LSM trees, the "truth" is a calculated merge of current fragments.
-
The Compaction Tax: Fast writes come at the cost of background merges to stabilize read performance.
-
Real Systems: If you're using Spanner, Cassandra, RocksDB, or Bigtable, LSM trees are running the show.