LSM Tree Indexes

Converting Random I/O to Sequential Throughput

LSM (Log-Structured Merge) Trees are designed for massive write volumes. They avoid the overhead of random disk updates by batching changes in memory and flushing them as sorted files.

O(1)

Write Path (RAM)

Sequential

Disk Storage

Merge-on-Read

Logic for State

Analytics

Primary Use Case

Block 1: The Atomic Unit of Data

Before looking at the system, let's look at the unit. Every record in an LSM tree is a simple (Key, Value) pair, often with a Timestamp.

Use Case: Real-time "Song Play" Counting

To make this concrete, let's track the number of plays on viral songs.

Key: Song ID (e.g., S_102)
Value: Increment (e.g., +1 play)

(S_102, +1, 13:42:05)

Key: Song 102
Value: +1 Play
Time: 1:42 PM

Block 2: The Production Path (Snapshot @ 1:44 PM)

Data flows from RAM to Disk. In this example, it's 1:44 PM. We see the results of user actions, 15-minute flushes, and hourly compactions.

Block 3: The Consumption Path (Merge on Read)

Because play counts for Song 75 are scattered across RAM and Disk, a query must perform a cross-tier "Merge on Read".

Core Concept Definitions

The LSM tree's performance is governed by four primary operations:

Flush: The process of taking a full MemTable in RAM, sorting it, and writing it as a single immutable SSTable on disk.
Merge on Read: The logic of scanning all possible locations for a key (RAM, Level 0, Level 1) and combining the results to return the current state.
Compaction: A background task that reads multiple SSTables and merges them into a single, larger SSTable, effectively removing duplicates and deleted records.
Policy: The rules governing when and how to compact.

LSM State Functions: Defining "The Truth"

The "State" of a record isn't stored; it is calculated. Depending on your application, you choose a different merge function:

Pattern	Logic	Use Case Example
Cumulative (Counting)	$SUM(v_1, v_2, ..., v_n)$	Song Play Counts: Summing +1 fragments across files.
Latest Wins (Upsert)	$MAX(timestamp)$	User Profiles: The most recent update "deletes" older values.
Statistical (Averages)	$Sum(sums), Sum(counts)$	Sensor Monitoring: Maintaining rolling averages of temps.
Retention (TTL)	$Filter(time > now - W)$	Session Logs: Automatically dropping data older than 24h.

Decision Matrix

Aspect	B+Tree Index	LSM Tree Index
Primary I/O Pattern	Random Disk Seek	Sequential Throughput
Write Cost	Expensive (Random)	Negligible (Buffered)
Read Cost	Predicted (Fixed)	Variable (Hunt across levels)
Fragmentation	Low (Internal)	High (External - requires cleaning)
Best For	Banking, ERP, OLTP	Analytics, Logs, Real-time Counters

Key Takeaways

Sequential is King: LSM converts chaotic random writes into a high-speed stream of sequential data.
State is Computed: In an LSM tree, the "truth" isn't stored in one place; it is calculated by merging current fragments.
The Compaction Tax: We pay for fast writes by performing background merges to keep read performance stable.
Real Systems: If you use Spanner, Cassandra, RocksDB, or Bigtable, you are using an LSM tree under the hood.