LSM Tree Indexes

Converting Random I/O to Sequential Throughput

LSM (Log-Structured Merge) Trees are the unsung heroes of high-volume write operations. They sidestep the inefficiencies of random disk updates by batching changes in memory and laying them down as sorted files.

O(1)

Write Path (RAM)

Sequential

Disk Storage

Merge-on-Read

Logic for State

Analytics

Primary Use Case

Block 1: The Atomic Unit of Data

Let's start with the basics. In an LSM tree, every record is a straightforward (Key, Value) pair, often timestamped.

Use Case: Real-time "Song Play" Counting

Consider tracking the number of plays on viral songs.

Key: Song ID (e.g., S_102)
Value: Increment (e.g., +1 play)

(S_102, +1, 13:42:05)

Key: Song 102
Value: +1 Play
Time: 1:42 PM

Block 2: The Production Path (Snapshot @ 1:44 PM)

Data moves from RAM to Disk. At 1:44 PM, the system reflects user actions, periodic flushes, and hourly compactions.

Block 3: The Consumption Path (Merge on Read)

For Song 75, play counts are scattered across RAM and Disk, necessitating a cross-tier "Merge on Read".

Core Concept Definitions

The LSM tree's efficiency hinges on four key operations:

Flush: Converts a full MemTable in RAM into a sorted, immutable SSTable on disk.
Merge on Read: Combines results from all possible locations (RAM, Level 0, Level 1) to return the current state.
Compaction: Merges multiple SSTables into a larger one, eliminating duplicates and outdated records.
Policy: Dictates when and how to compact.

LSM State Functions: Defining "The Truth"

In LSM trees, the "State" isn't stored; it's computed. Choose your merge function based on your application:

Pattern	Logic	Use Case Example
Cumulative (Counting)	$SUM(v_1, v_2, ..., v_n)$	Song Play Counts: Summing +1 fragments across files.
Latest Wins (Upsert)	$MAX(timestamp)$	User Profiles: The most recent update "deletes" older values.
Statistical (Averages)	$Sum(sums), Sum(counts)$	Sensor Monitoring: Maintaining rolling averages of temps.
Retention (TTL)	$Filter(time > now - W)$	Session Logs: Automatically dropping data older than 24h.

Decision Matrix

Aspect	B+Tree Index	LSM Tree Index
Primary I/O Pattern	Random Disk Seek	Sequential Throughput
Write Cost	Expensive (Random)	Negligible (Buffered)
Read Cost	Predicted (Fixed)	Variable (Hunt across levels)
Fragmentation	Low (Internal)	High (External - requires cleaning)
Best For	Banking, ERP, OLTP	Analytics, Logs, Real-time Counters

Key Takeaways

Sequential is King: LSM transforms chaotic random writes into a streamlined sequence.
State is Computed: In LSM trees, the "truth" is a calculated merge of current fragments.
The Compaction Tax: Fast writes come at the cost of background merges to stabilize read performance.
Real Systems: If you're using Spanner, Cassandra, RocksDB, or Bigtable, LSM trees are running the show.