Case Study 3.3: Tracking High-Volume Activity with LSM Trees
How does Spotify track billions of "Likes" and "Plays"?
Goal: Design data structures for high-volume incremental updates.
Every time a user "Likes" a song or plays a track, Spotify must record it immediately. For a global system with 700M+ users, this creates a massive write-volume bottleneck.
The Bottleneck: Random Writes in B+Trees
In a B+Tree, every update requires a disk seek to find the correct leaf node. If you have 100 million songs, updating the play count for a single song means searching through the tree, loading the correct page, and overwriting the value. Doing this millions of times per second globally is too slow for standard hardware.
The Solution: The "No-Read" Write
Most big data systems (e.g., Google's BigTable, Google's Spanner, Apache Cassandra) use LSM Trees to avoid the "Read-before-Write" penalty. In an LSM Tree, the system never searches the disk during a write.
The Analogy: Encyclopedia vs. Sticky Notes
- B+Tree (The Encyclopedia): To update a fact, you must find the exact volume and page before you can write. If a song gets a new Like, you have to hunt through the shelf, find the page, and edit it. The write is slow because it is gated by a search.
- LSM (The Sticky Note): You simply write "SongA: +1" on a note and slap it on the wall. You don't care where the other likes are.
- The Secret: When the wall is full, you take all the notes, sort them, and bind them into a thin booklet (SSTable).
- Deferred Work: We still organize the data, but we do it in high-speed batches (merge multiple sorted files in SSTables) long after the user's action is recorded.
How it Works: RAM, Flush, and Merge
-
Buffered Writes (MemTable): Every "Like" is recorded in a high-speed RAM buffer ($O(1)$). We don't check the current total; we just record the increment.
-
Sequential Flushes (SSTables): When the RAM fills up, the system sorts the data and writes it to disk as a Sorted String Table (SSTable). Because the data is already sorted in RAM, the disk write is sequential, which is 100x faster than random seeks.
-
Merge on Read: To find the total likes, the system sums fragments from the MemTable and all SSTables on disk.
Visual Reference: See the "Production Path" and "Consumption Path" diagrams in the LSM Tree Indexes section for a detailed view of these tiers.
Beyond Counters: Customizing Merge Behavior
LSM trees aren't limited to just adding numbers. You can customize the Merge Behavior to handle different types of updates.
-
Latest Wins: Profile Updates and Status When a user updates their profile picture or a Discord user changes their status from "Away" to "Online," we don't search the disk to overwrite the old value. We simply append the newest status to the MemTable. When a query comes in, the system scans from newest to oldest (MemTable → Temp SSTables → Main SSTable) and returns the first version it finds. The most recent update effectively "hides" the old ones.
-
Messaging: WhatsApp and Slack In a chat app, every message is an immutable event. Instead of updating a "conversation" object, the system simply appends new messages to a log. When you open a chat, the system performs a Merge on Read, fetching the latest fragments from RAM and disk to reconstruct your conversation in chronological order. This allows these apps to handle millions of messages per second without locking database rows.
-
Real-Time Tracking: Uber and Lyft Ride-sharing apps track driver GPS coordinates every second. Overwriting a single "location" column for every driver would create massive contention. Instead, these systems use LSM-like structures to append every GPS ping as a new record. The passenger's app only cares about the Latest Win (the most recent coordinate), while the background system eventually compacts the history to analyze the driver's route later.
-
Deletes: The Tombstone Pattern Because SSTables are immutable, you cannot delete data from a file already on disk. To delete a record, we write a Tombstone—a special marker indicating the key is removed. On read, if the system hits a tombstone, it returns "not found." During the next Compaction, the background process identifies the tombstone and permanently deletes both the marker and the old data.
Takeaway: By treating the database as a Log (append-only) rather than a Ledger (overwriting), LSM trees decouple write performance from data size. Whether you have 1,000 songs or 100 billion, the write speed remains constant.