LSM Tree Indexes

Converting Random I/O to Sequential Throughput

LSM (Log-Structured Merge) Trees are designed for massive write volumes. They avoid the overhead of random disk updates by batching changes in memory and flushing them as sorted files.

O(1)
Write Path (RAM)
Sequential
Disk Storage
Merge-on-Read
Logic for State
Analytics
Primary Use Case

Block 1: The Atomic Unit of Data

Before looking at the system, let's look at the unit. Every record in an LSM tree is a simple (Key, Value) pair, often with a Timestamp.

Use Case: Real-time "Song Play" Counting

To make this concrete, let's track the number of plays on viral songs.

(S_102, +1, 13:42:05)

Key: Song 102
Value: +1 Play
Time: 1:42 PM

Block 2: The Production Path (Snapshot @ 1:44 PM)

Data flows from RAM to Disk. In this example, it's 1:44 PM. We see the results of user actions, 15-minute flushes, and hourly compactions.

USER ACTION (LIVE) EVERY 15 MINS EVERY HOUR (ON THE HOUR) Incoming Plays MemTable (Live @ 1:44 PM) S_12:1, S_44:2, S_75:+2 , S_99:1, S_102:+1 , S_201:5 ... FLUSH (15m interval) Partial SSTables (Level 0) ... S_55:1, S_75:+10 S_88:2 ... 1:30 PM.db ... S_12:5, S_102:+5 S_111:1 ... 1:15 PM.db ... S_22:4, S_75:+8 S_99:2 ... 1:00 PM.db COMPACT (Hourly Merge) Main SSTable (Level 1+) merged_before_1:00_PM.db ... S_50:110, S_62:45, S_75: 5,420 Plays , S_88:12 ...

Block 3: The Consumption Path (Merge on Read)

Because play counts for Song 75 are scattered across RAM and Disk, a query must perform a cross-tier "Merge on Read".

Querying Total Plays for S_75 USER QUERY GET plays WHERE song = 'S_75' 1. MemTable (Live Updates @ 1:44 PM) +2 2. Partial SSTables (Flushed @ 1:15, 1:30) (Summing across level 0 files) +18 3. Main SSTable (History before 1:00 PM) +5,420 STATE @ 1:44 PM 5,440 SONG PLAYS

Core Concept Definitions

The LSM tree's performance is governed by four primary operations:


LSM State Functions: Defining "The Truth"

The "State" of a record isn't stored; it is calculated. Depending on your application, you choose a different merge function:

Pattern Logic Use Case Example
Cumulative (Counting) $SUM(v_1, v_2, ..., v_n)$ Song Play Counts: Summing +1 fragments across files.
Latest Wins (Upsert) $MAX(timestamp)$ User Profiles: The most recent update "deletes" older values.
Statistical (Averages) $Sum(sums), Sum(counts)$ Sensor Monitoring: Maintaining rolling averages of temps.
Retention (TTL) $Filter(time > now - W)$ Session Logs: Automatically dropping data older than 24h.

Decision Matrix

Aspect B+Tree Index LSM Tree Index
Primary I/O Pattern Random Disk Seek Sequential Throughput
Write Cost Expensive (Random) Negligible (Buffered)
Read Cost Predicted (Fixed) Variable (Hunt across levels)
Fragmentation Low (Internal) High (External - requires cleaning)
Best For Banking, ERP, OLTP Analytics, Logs, Real-time Counters

Key Takeaways

  1. Sequential is King: LSM converts chaotic random writes into a high-speed stream of sequential data.

  2. State is Computed: In an LSM tree, the "truth" isn't stored in one place; it is calculated by merging current fragments.

  3. The Compaction Tax: We pay for fast writes by performing background merges to keep read performance stable.

  4. Real Systems: If you use Spanner, Cassandra, RocksDB, or Bigtable, you are using an LSM tree under the hood.