Case Study 3.3: Tracking High-Volume Activity with LSM Trees
How does Spotify track billions of "Likes" and "Plays"?
Goal: Design data structures for high-volume incremental updates.
When a Spotify user interacts with a song, that action needs to be logged instantly. With a user base of 700 million, the challenge isn't just technical—it's a relentless barrage of write operations that would overwhelm traditional systems.
The Bottleneck: Random Writes in B+Trees
In a B+Tree, each update requires a disk seek to find the right leaf node. Imagine having to update the play count for any of 100 million songs. Each update means searching through the tree, loading the right page, and changing the value. Multiply this by millions of updates per second, and you hit a wall with standard hardware.
The Solution: The "No-Read" Write
Enter the LSM Tree, the workhorse behind systems like Google's BigTable and Apache Cassandra. These systems dodge the "Read-before-Write" penalty by eliminating the need for a disk search during a write.
How it Works: RAM, Flush, and Merge
-
Buffered Writes (MemTable): Each "Like" is captured in a high-speed RAM buffer ($O(1)$). No need to check existing totals; just log the increment.
-
Sequential Flushes (SSTables): Once RAM fills up, data is sorted and written to disk as a Sorted String Table (SSTable). Sorting in RAM allows for sequential disk writes, which are much faster than random ones.
-
Merge on Read: To get total likes, the system sums fragments from the MemTable and all SSTables on disk.
Visual Reference: Check out the "Production Path" and "Consumption Path" diagrams in the LSM Tree Indexes section for a detailed view.
Beyond Counters: Customizing Merge Behavior
LSM trees aren't limited to counting. They can be tailored for various update types.
-
Latest Wins: Profile Updates and Status For updates like profile pictures or status changes, there's no disk search to overwrite old data. Append the new status to the MemTable. When queried, the system scans from newest to oldest (MemTable → Temp SSTables → Main SSTable) and returns the first version it finds, effectively hiding older updates.
-
Messaging: WhatsApp and Slack In chat apps, each message is an immutable event. Instead of updating a "conversation" object, new messages are appended to a log. When you open a chat, the system performs a Merge on Read, fetching the latest fragments from RAM and disk to reconstruct the conversation in order. This design handles millions of messages per second without database locks.
-
Real-Time Tracking: Uber and Lyft Ride-sharing apps log driver GPS coordinates every second. Overwriting a single "location" column for each driver would create massive contention. Instead, these systems use LSM-like structures to append each GPS ping as a new record. The passenger's app only needs the most recent coordinate, while the system compacts the history later to analyze routes.
-
Deletes: The Tombstone Pattern SSTables are immutable, so you can't delete data from a file already on disk. To delete a record, write a Tombstone—a marker indicating the key is removed. On read, if the system hits a tombstone, it returns "not found." During the next Compaction, the background process identifies the tombstone and permanently deletes both the marker and the old data.
Takeaway: By treating the database as a Log (append-only) rather than a Ledger (overwriting), LSM trees keep write performance steady, regardless of data size. Whether you manage 1,000 songs or 100 billion, the write speed stays constant.