Case Study 3.4: The Engineering of "Spotify Wrapped"
How does Spotify survive its largest annual traffic spike?
Goal: Optimize data access using Hash Indexes, B+ Trees, and Physical Clustering.
Spotify Wrapped isn't one feature. It's two different engineering problems. First, an Offline Phase that aggregates trillions of listens, and second, an Online Phase that delivers results to 700M users at once.
Phase 1: The Offline Aggregation (Building the Magic)
Before December, Spotify must scan the Listens Table to calculate every user's top songs. Suppose you have 1,000 listens in the year.
Option 1: Unclustered
The table is physically stored in the order the songs were played (chronological).
- The Aggregation Bottleneck: Even with an index on
user_id, the engine would have to fetch your 1,000 plays from (likely) 1,000 different pages across the disk. This creates massive random I/O. Calculating Wrapped for everyone would take weeks of IOs.
Option 2: Clustered
Spotify re-sorts the trillion-row table by (user_id, listen_time). This is a Clustered Index for the Wrapped query on these two columns.
-
The I/O Win: Your 1,000 plays are now on the same physical page (64 MBs) after sorting. The engine performs a Sequential Scan. It reads your page, finds your data, and the next few records are guaranteed to be yours.
-
Why have a B+ Tree Index here?: You need it for Date-Specific Lookups. If you want to see exactly what you played on "July 12th at 9 PM," the B+ Tree lets you jump instantly to that page. Sorting provides the proximity (i.e,, clusters the needed rows together); the Index helps jumps to the right page.
Phase 2: The Online Delivery (The Launch)
Once the summaries are computed, they are stored in a Wrapped table.
1. The Wrapped Card: Hash Index
When you tap your personal Wrapped card, the system finds your precomputed summary.
-
The Choice: Hash Index on
user_id. -
The Logic: A Hash Index provides $O(1)$ performance. It is built for handling simultaneous requests from millions of users hitting the app.
2. The "Deep Dive": B+ Tree Index
Some users want to go deeper and see their full "Recently Played" history from the raw Listens table.
-
The Problem: A Hash Index cannot find a range of records (e.g., "See all songs I played in December").
-
The Choice: B+ Tree Index on the Clustered Listens table.
-
The Win: The B+ Tree finds the "start" of December and scans through the leaf nodes. Since the data is Clustered, this scroll is as smooth as reading a single file.
The Tradeoff: The One-Dimension Rule
A table can only be clustered (sorted) by one dimension at a time. Because clustering is a physical sort, you cannot have the same data sit in two different orders simultaneously.
-
Sorted by User (Wrapped Mode): Data is stored as
(user_id, listen_time). This makes Wrapped calculations and your personal history deep-dives instant ($O(N)$ sequential scan). -
Sorted by Artist (Discography Mode): If Spotify wanted to calculate the "Top 10 Fans of Taylor Swift," it would be better to cluster by
(artist_id, user_id). -
The Conflict: If you cluster by Artist, computing a single user's Wrapped is Unclustered (and slow).
The Solution: At Spotify's scale, they often maintain multiple copies of the same table, each clustered differently to support different features!
Final Takeaway: "Clustering" is for the heavy lifting of batch processing. "Hashing" is for the instant search. "B+ Trees" are for navigating through time and cities.