LSM Tree Indexes

Concept. An LSM tree converts random writes into sequential disk writes by buffering inserts in a RAM MemTable, then flushing them as immutable SSTables that are merged in the background.

Intuition. When Spotify ingests 1 million listen events per second, a B+Tree can't keep up, because every insert is a random disk write. LSM trees absorb all writes into RAM first, then dump them to disk as a single sequential file. Reads check RAM first, then the on-disk SSTables.

Prerequisites. SSTables introduces the immutable file format LSM trees are built from.


What Gets Stored

Every record is a (key, value, timestamp) triple. For a song-play counter, that looks like:

(S_102, +1, 13:42:05)

Song ID is the key, the increment is the value, and wall-clock time is the version. No row is ever updated in place. A new (S_102, +1, …) is appended for every play, and the "current state" of S102 is whatever you get when you add all of its values together.


Writes: How Data Moves from RAM to Disk

Writes flow downward through three tiers. The caption below names each piece and walks the live S75 example.

LSM write path at 1:44 PM. Incoming plays accumulate in the red MemTable in RAM, flush every 15 minutes into green Level-0 SSTables (1:00, 1:15, and 1:30 .db) on disk, and merge hourly into a green compacted Level-1+ SSTable. Song S_75's purple play counts are spread across all three tiers.

Figure 1. LSM write path at 1:44 PM, RAM on top and disk below, time flowing downward. Incoming plays land in the in-RAM MemTable (fast, no seek, lost on crash unless replayed from the WAL); every 15 minutes the engine flushes it to an immutable Level-0 SSTable on disk (1:00, 1:15, 1:30), and hourly compaction merges those into one larger sorted, de-duplicated Level-1+ file. Total plays for S75 smear across tiers (+2 in the MemTable, +10 and +8 in Level-0, 5,420 in compacted history) and the read path sums them to 5,440 (merge-on-read). The merge operator is configurable: "add" for counters, "newest wins" for last-write-wins keys, "set union" for tags, "max" for high-water-marks. The merge section walks through more examples.

No write ever does a random seek. The MemTable is RAM, the flush is one sequential append, and compaction is a streaming merge. This is how an LSM tree sustains a million writes per second on commodity disks.


Reads: Merge on Read

The real state of a key isn't stored anywhere. It's computed by walking every tier that might hold the key and combining the results.

Merge-on-read. A query for total plays of S_75 gets +2 from the red MemTable, +18 from the three green Level-0 SSTables, and +5,420 from the green Level-1+ SSTable, summed in purple to 5,440 total plays at 1:44 PM.

Figure 2. Merge-on-read for S75. The query computes state at read time instead of storing it.

  • MemTable (RAM): contributes +2 from the live in-RAM buffer.
  • Level-0 SSTables (disk): contribute +10 · +5 · +3 = +18 across the three flushed files.
  • Level-1+ SSTable (disk, compacted): contributes +5,420 from the merged history.
  • Live trace, summed result: 2 + 18 + 5,420 = 5,440. The merge function here is "add" because plays are a counter.
  • Read-cost optimization. Each SSTable carries a Bloom filter, so the reader skips any tier that can't contain the key, with no disk read. Most reads are much cheaper than this worst case.

A naïve read would touch every SSTable. In practice each SSTable carries a bloom filter, so tiers that can't possibly contain the key are skipped without a disk read. Hot keys in recent SSTables answer fast, and cold keys may walk deep.


How "Merge" Is Defined

Because LSM reads compute state from fragments, the "merge" step is really an application-specific reduction. Four patterns cover most systems:

Pattern Merge logic Example
Counter sum(v₁, v₂, …) Song play counts. Sum the +1 fragments.
Latest-wins pick max(timestamp) User profile updates. Newest overwrites.
Statistical sum(sums), sum(counts) Rolling averages. Keep running totals.
TTL / retention filter(time > now − window) Session logs. Drop events older than 24 hours.

Next

IO Cost & Trade-offs → You now have all four index structures. The next page puts them on one cost model so you can decide which to pick for a given access pattern, before the case studies apply them to real systems.