Case Study 3.3: Tracking High-Volume Activity with LSM Trees

Case Study 3.3 Reading Time: 8 mins

How does Spotify track billions of "Likes" and "Plays"?

Goal: Design data structures for high-volume incremental updates.

Every time a user "Likes" a song or plays a track, Spotify must record it immediately. For a global system with 700M+ users, this creates a massive write-volume bottleneck.

The Bottleneck: Random Writes in B+Trees

In a B+Tree, every update requires a disk seek to find the correct leaf node. If you have 100 million songs, updating the play count for a single song means searching through the tree, loading the correct page, and overwriting the value. Doing this millions of times per second globally is too slow for standard hardware.


The Solution: The "No-Read" Write

Most big data systems (e.g., Google's BigTable, Google's Spanner, Apache Cassandra) use LSM Trees to avoid the "Read-before-Write" penalty. In an LSM Tree, the system never searches the disk during a write.

The Analogy: Encyclopedia vs. Sticky Notes

  • B+Tree (The Encyclopedia): To update a fact, you must find the exact volume and page before you can write. If a song gets a new Like, you have to hunt through the shelf, find the page, and edit it. The write is slow because it is gated by a search.
  • LSM (The Sticky Note): You simply write "SongA: +1" on a note and slap it on the wall. You don't care where the other likes are.
    • The Secret: When the wall is full, you take all the notes, sort them, and bind them into a thin booklet (SSTable).
    • Deferred Work: We still organize the data, but we do it in high-speed batches (merge multiple sorted files in SSTables) long after the user's action is recorded.

How it Works: RAM, Flush, and Merge

  1. Buffered Writes (MemTable): Every "Like" is recorded in a high-speed RAM buffer ($O(1)$). We don't check the current total; we just record the increment.

  2. Sequential Flushes (SSTables): When the RAM fills up, the system sorts the data and writes it to disk as a Sorted String Table (SSTable). Because the data is already sorted in RAM, the disk write is sequential, which is 100x faster than random seeks.

  3. Merge on Read: To find the total likes, the system sums fragments from the MemTable and all SSTables on disk.

Visual Reference: See the "Production Path" and "Consumption Path" diagrams in the LSM Tree Indexes section for a detailed view of these tiers.


Beyond Counters: Customizing Merge Behavior

LSM trees aren't limited to just adding numbers. You can customize the Merge Behavior to handle different types of updates.

Takeaway: By treating the database as a Log (append-only) rather than a Ledger (overwriting), LSM trees decouple write performance from data size. Whether you have 1,000 songs or 100 billion, the write speed remains constant.