LSM Tree Indexes
Converting Random I/O to Sequential Throughput
LSM (Log-Structured Merge) Trees are designed for massive write volumes. They avoid the overhead of random disk updates by batching changes in memory and flushing them as sorted files.
Block 1: The Atomic Unit of Data
Before looking at the system, let's look at the unit. Every record in an LSM tree is a simple (Key, Value) pair, often with a Timestamp.
Use Case: Real-time "Song Play" Counting
To make this concrete, let's track the number of plays on viral songs.
-
Key: Song ID (e.g.,
S_102) -
Value: Increment (e.g.,
+1play)
(S_102, +1, 13:42:05)Value: +1 Play
Time: 1:42 PM
Block 2: The Production Path (Snapshot @ 1:44 PM)
Data flows from RAM to Disk. In this example, it's 1:44 PM. We see the results of user actions, 15-minute flushes, and hourly compactions.
Block 3: The Consumption Path (Merge on Read)
Because play counts for Song 75 are scattered across RAM and Disk, a query must perform a cross-tier "Merge on Read".
Core Concept Definitions
The LSM tree's performance is governed by four primary operations:
-
Flush: The process of taking a full MemTable in RAM, sorting it, and writing it as a single immutable SSTable on disk.
-
Merge on Read: The logic of scanning all possible locations for a key (RAM, Level 0, Level 1) and combining the results to return the current state.
-
Compaction: A background task that reads multiple SSTables and merges them into a single, larger SSTable, effectively removing duplicates and deleted records.
-
Policy: The rules governing when and how to compact.
LSM State Functions: Defining "The Truth"
The "State" of a record isn't stored; it is calculated. Depending on your application, you choose a different merge function:
| Pattern | Logic | Use Case Example |
|---|---|---|
| Cumulative (Counting) | $SUM(v_1, v_2, ..., v_n)$ | Song Play Counts: Summing +1 fragments across files. |
| Latest Wins (Upsert) | $MAX(timestamp)$ | User Profiles: The most recent update "deletes" older values. |
| Statistical (Averages) | $Sum(sums), Sum(counts)$ | Sensor Monitoring: Maintaining rolling averages of temps. |
| Retention (TTL) | $Filter(time > now - W)$ | Session Logs: Automatically dropping data older than 24h. |
Decision Matrix
| Aspect | B+Tree Index | LSM Tree Index |
|---|---|---|
| Primary I/O Pattern | Random Disk Seek | Sequential Throughput |
| Write Cost | Expensive (Random) | Negligible (Buffered) |
| Read Cost | Predicted (Fixed) | Variable (Hunt across levels) |
| Fragmentation | Low (Internal) | High (External - requires cleaning) |
| Best For | Banking, ERP, OLTP | Analytics, Logs, Real-time Counters |
Key Takeaways
-
Sequential is King: LSM converts chaotic random writes into a high-speed stream of sequential data.
-
State is Computed: In an LSM tree, the "truth" isn't stored in one place; it is calculated by merging current fragments.
-
The Compaction Tax: We pay for fast writes by performing background merges to keep read performance stable.
-
Real Systems: If you use Spanner, Cassandra, RocksDB, or Bigtable, you are using an LSM tree under the hood.