LSM Tree Indexes
Concept. An LSM tree converts random writes into sequential disk writes by buffering inserts in a RAM MemTable, then flushing them as immutable SSTables that are merged in the background.
Intuition. When Spotify ingests 1 million listen events per second, a B+Tree can't keep up, because every insert is a random disk write. LSM trees absorb all writes into RAM first, then dump them to disk as a single sequential file. Reads check RAM first, then the on-disk SSTables.
Prerequisites. SSTables introduces the immutable file format LSM trees are built from.
What Gets Stored
Every record is a (key, value, timestamp) triple. For a song-play counter, that looks like:
(S_102, +1, 13:42:05)
Song ID is the key, the increment is the value, and wall-clock time is the version. No row is ever updated in place. A new (S_102, +1, …) is appended for every play, and the "current state" of S102 is whatever you get when you add all of its values together.
Writes: How Data Moves from RAM to Disk
Writes flow downward through three tiers. The caption below names each piece and walks the live S75 example.
Figure 1. LSM write path at 1:44 PM, RAM on top and disk below, time flowing downward. Incoming plays land in the in-RAM MemTable (fast, no seek, lost on crash unless replayed from the WAL); every 15 minutes the engine flushes it to an immutable Level-0 SSTable on disk (1:00, 1:15, 1:30), and hourly compaction merges those into one larger sorted, de-duplicated Level-1+ file. Total plays for S75 smear across tiers (+2 in the MemTable, +10 and +8 in Level-0, 5,420 in compacted history) and the read path sums them to 5,440 (merge-on-read). The merge operator is configurable: "add" for counters, "newest wins" for last-write-wins keys, "set union" for tags, "max" for high-water-marks. The merge section walks through more examples.
No write ever does a random seek. The MemTable is RAM, the flush is one sequential append, and compaction is a streaming merge. This is how an LSM tree sustains a million writes per second on commodity disks.
Reads: Merge on Read
The real state of a key isn't stored anywhere. It's computed by walking every tier that might hold the key and combining the results.
Figure 2. Merge-on-read for S75. The query computes state at read time instead of storing it.
- MemTable (RAM): contributes
+2from the live in-RAM buffer. - Level-0 SSTables (disk): contribute
+10 · +5 · +3 = +18across the three flushed files. - Level-1+ SSTable (disk, compacted): contributes
+5,420from the merged history. - Live trace, summed result:
2 + 18 + 5,420 = 5,440. The merge function here is "add" because plays are a counter. - Read-cost optimization. Each SSTable carries a Bloom filter, so the reader skips any tier that can't contain the key, with no disk read. Most reads are much cheaper than this worst case.
A naïve read would touch every SSTable. In practice each SSTable carries a bloom filter, so tiers that can't possibly contain the key are skipped without a disk read. Hot keys in recent SSTables answer fast, and cold keys may walk deep.
How "Merge" Is Defined
Because LSM reads compute state from fragments, the "merge" step is really an application-specific reduction. Four patterns cover most systems:
| Pattern | Merge logic | Example |
|---|---|---|
| Counter | sum(v₁, v₂, …) | Song play counts. Sum the +1 fragments. |
| Latest-wins | pick max(timestamp) | User profile updates. Newest overwrites. |
| Statistical | sum(sums), sum(counts) | Rolling averages. Keep running totals. |
| TTL / retention | filter(time > now − window) | Session logs. Drop events older than 24 hours. |
Next
IO Cost & Trade-offs → You now have all four index structures. The next page puts them on one cost model so you can decide which to pick for a given access pattern, before the case studies apply them to real systems.