LSM Tree Indexes

Converting Random I/O to Sequential Throughput

LSM (Log-Structured Merge) Trees are the unsung heroes of high-volume write operations. They sidestep the inefficiencies of random disk updates by batching changes in memory and laying them down as sorted files.

O(1)
Write Path (RAM)
Sequential
Disk Storage
Merge-on-Read
Logic for State
Analytics
Primary Use Case

Block 1: The Atomic Unit of Data

Let's start with the basics. In an LSM tree, every record is a straightforward (Key, Value) pair, often timestamped.

Use Case: Real-time "Song Play" Counting

Consider tracking the number of plays on viral songs.

(S_102, +1, 13:42:05)
Key: Song 102
Value: +1 Play
Time: 1:42 PM

Block 2: The Production Path (Snapshot @ 1:44 PM)

Data moves from RAM to Disk. At 1:44 PM, the system reflects user actions, periodic flushes, and hourly compactions.

USER ACTION (LIVE) EVERY 15 MINS EVERY HOUR (ON THE HOUR) Incoming Plays MemTable (Live @ 1:44 PM) S_12:1, S_44:2, S_75:+2 , S_99:1, S_102:+1 , S_201:5 ... FLUSH (15m interval) Partial SSTables (Level 0) ... S_55:1, S_75:+10 S_88:2 ... 1:30 PM.db ... S_12:5, S_102:+5 S_111:1 ... 1:15 PM.db ... S_22:4, S_75:+8 S_99:2 ... 1:00 PM.db COMPACT (Hourly Merge) Main SSTable (Level 1+) merged_before_1:00_PM.db ... S_50:110, S_62:45, S_75: 5,420 Plays , S_88:12 ...

Block 3: The Consumption Path (Merge on Read)

For Song 75, play counts are scattered across RAM and Disk, necessitating a cross-tier "Merge on Read".

Querying Total Plays for S_75 USER QUERY GET plays WHERE song = 'S_75' 1. MemTable (Live Updates @ 1:44 PM) +2 2. Partial SSTables (Flushed @ 1:15, 1:30) (Summing across level 0 files) +18 3. Main SSTable (History before 1:00 PM) +5,420 STATE @ 1:44 PM 5,440 SONG PLAYS

Core Concept Definitions

The LSM tree's efficiency hinges on four key operations:


LSM State Functions: Defining "The Truth"

In LSM trees, the "State" isn't stored; it's computed. Choose your merge function based on your application:

Pattern Logic Use Case Example
Cumulative (Counting) $SUM(v_1, v_2, ..., v_n)$ Song Play Counts: Summing +1 fragments across files.
Latest Wins (Upsert) $MAX(timestamp)$ User Profiles: The most recent update "deletes" older values.
Statistical (Averages) $Sum(sums), Sum(counts)$ Sensor Monitoring: Maintaining rolling averages of temps.
Retention (TTL) $Filter(time > now - W)$ Session Logs: Automatically dropping data older than 24h.

Decision Matrix

Aspect B+Tree Index LSM Tree Index
Primary I/O Pattern Random Disk Seek Sequential Throughput
Write Cost Expensive (Random) Negligible (Buffered)
Read Cost Predicted (Fixed) Variable (Hunt across levels)
Fragmentation Low (Internal) High (External - requires cleaning)
Best For Banking, ERP, OLTP Analytics, Logs, Real-time Counters

Key Takeaways

  1. Sequential is King: LSM transforms chaotic random writes into a streamlined sequence.

  2. State is Computed: In LSM trees, the "truth" is a calculated merge of current fragments.

  3. The Compaction Tax: Fast writes come at the cost of background merges to stabilize read performance.

  4. Real Systems: If you're using Spanner, Cassandra, RocksDB, or Bigtable, LSM trees are running the show.