Case Study 3.3: Tracking High-Volume Activity with LSM Trees
Concept. Spotify tracks billions of "like" and "play" events using LSM-tree-backed key-value stores, which sustain millions of writes per second by appending to an in-memory MemTable and flushing sequentially.
Intuition. When 700 million users each like and play songs continuously, a relational table cannot absorb the write rate. Spotify routes the firehose into a LSM-backed Cassandra cluster: every event is buffered in RAM, periodically flushed as an immutable SSTable. Reads aggregate across MemTable + SSTables.
How does Spotify track billions of "Likes" and "Plays"?
Goal: Design data structures for high-volume incremental updates.
When a Spotify user interacts with a song, the action is logged instantaneously. Processing 700 million concurrent clients demands sustained, massive append throughput. High-velocity random inserts will bound disk controller bandwidth under traditional tree architecture.
The Bottleneck: Random Writes in B+Trees
In a B+Tree, each update requires a disk seek to find the right leaf node. Imagine having to update the play count for any of 100 million songs. Each update means searching through the tree, loading the right page, and changing the value. Multiply this by millions of updates per second, and you hit a wall with standard hardware.
The Solution: RAM-Bounded Writes
Log-Structured Merge (LSM) Trees (powering BigTable and Cassandra) avoid the traditional B+Tree read-before-write penalty by isolating structural mutations to in-memory buffers.
How it Works: RAM, Flush, and Merge
-
Buffered Writes (MemTable): Each "Like" is captured in a high-speed RAM buffer ($O(1)$). No need to check existing totals; just log the increment.
-
Sequential Flushes (SSTables): Once RAM fills up, data is sorted and written to disk as a Sorted String Table (SSTable). Sorting in RAM allows for sequential disk writes, which are much faster than random ones.
-
Merge on Read: To get total likes, the system sums fragments from the MemTable and all SSTables on disk.
Visual Reference: Check out the "Production Path" and "Consumption Path" diagrams in the LSM Tree Indexes section for a detailed view.
Beyond Counters: Customizing Merge Behavior
LSM trees aren't limited to counting. They can be tailored for various update types.
-
Latest Wins: The "Last Played Position" When you pause a 3-hour podcast, Spotify must save your exact timestamp. With millions pausing and resuming constantly, doing random B+Tree disk updates is fatal. Instead, Spotify appends
(user_id, podcast_id, position)to the MemTable. When you resume, it scans newest to oldest (MemTable → SSTables), grabs the first version it hits, and ignores the millions of older timestamps physically buried in older SSTable files. -
The Activity Firehose: Friend Feed Spotify's "Friend Activity" feed tracks what millions of users are listening to right now. These are immutable events. Instead of locking and updating a "user_profile" row every 3 minutes, every play is just a fast memory append to the log. The dashboard performs a Merge on Read, grabbing the latest fragments across RAM and disk to render the real-time feed without ever causing write-locks.
-
Deletes: The Tombstone Pattern (Un-liking a Song) SSTables are immutable, so you can't erase data from a file already written to disk. If you "Unlike" a previously liked song, Spotify issues a Tombstone—a special marker written to the MemTable indicating the key is removed. On read, if the system hits a tombstone first, it immediately returns "not found" (even if the old 'Like' exists in a deeper SSTable). During the next background Compaction, the system matches the tombstone against the old log and permanently drops both from the new file.
Takeaway: By treating the database as a Log (append-only) rather than a Ledger (overwriting), LSM trees keep write performance steady, regardless of data size. Whether you manage 1,000 songs or 100 billion, the write speed stays constant.