Case Study 3.4: The Engineering of "Spotify Wrapped"
Concept. Spotify Wrapped pre-aggregates each user's full year of listening into a small per-user record overnight, so the December 1 traffic spike reads from indexed rollups instead of scanning a year of raw plays.
Intuition. When 700 million users hit "show my Wrapped" on December 1, you cannot scan a year of 365 billion play events in real time. Instead, every night a batch job aggregates each user's listens into a single row keyed by user_id, indexed for instant lookup.
How does Spotify survive its largest annual traffic spike?
Goal: Optimize data access using Hash Indexes, B+ Trees, and Physical Clustering.
Spotify Wrapped requires discrete infrastructure provisioning. The Batch Phase aggregates planetary-scale tables, while the Online Phase propagates isolated static payloads to 700 million clients.
Phase 1: The Offline Aggregation
Before December, Spotify sifts through the Listens Table to tally each user's top tracks. Say you logged 1,000 listens this year.
Option 1: Unclustered
The table is stored chronologically, as songs were played.
- The Aggregation Bottleneck: Even with an index on
user_id, the engine would likely need to retrieve your 1,000 plays from 1,000 different pages across the disk. This leads to massive random I/O. Calculating Wrapped for everyone would drag on for weeks.
Option 2: Clustered
Spotify reorders the trillion-row table by (user_id, listen_time), creating a Clustered Index for the Wrapped query on these columns.
-
The I/O Win: Your 1,000 plays are now consolidated on the same physical page (64 MBs) after sorting. The engine performs a Sequential Scan. It reads your page, finds your data, and the next few records are guaranteed to be yours.
-
Why a B+ Tree Index here?: It's essential for Date-Specific Lookups. If you want to pinpoint what you played on "July 12th at 9 PM," the B+ Tree lets you jump directly to that page. Sorting clusters the data; the Index gets you to the right spot.
Phase 2: The Online Delivery
Once summaries are computed, they're stored in a Wrapped table.
1. The Wrapped Card: Hash Index
When you tap your personal Wrapped card, the system retrieves your precomputed summary.
-
The Choice: Hash Index on
user_id. -
The Logic: A Hash Index offers $O(1)$ performance. It's designed for handling simultaneous requests from millions of users accessing the app.
2. The "Internals": B+ Tree Index
Some users want to delve deeper and see their full "Recently Played" history from the raw Listens table.
-
The Problem: A Hash Index can't handle a range of records (e.g., "See all songs I played in December").
-
The Choice: B+ Tree Index on the Clustered Listens table.
-
The Win: The B+ Tree locates the "start" of December and scans through the leaf nodes. Since the data is Clustered, this scroll is seamless, like reading a single file.
The Tradeoff: The One-Dimension Rule
A table can only be clustered (sorted) by one dimension at a time. Clustering is a physical sort, so data can't sit in two different orders simultaneously.
-
Sorted by User (Wrapped Mode): Data is stored as
(user_id, listen_time). This makes Wrapped calculations and personal history deep-dives instant ($O(N)$ sequential scan). -
Sorted by Artist (Discography Mode): If Spotify wanted to identify the "Top 10 Fans of Taylor Swift," clustering by
(artist_id, user_id)would be optimal. -
The Conflict: If you cluster by Artist, computing a single user's Wrapped becomes Unclustered (and slow).
The Solution: At Spotify's scale, they often maintain multiple copies of the same table, each clustered differently to support various features.
Final Takeaway: "Clustering" handles the heavy lifting of batch processing. "Hashing" is for instant search. "B+ Trees" navigate through time and data landscapes.