Case Study 3.4: The Engineering of "Spotify Wrapped"

Concept. Spotify Wrapped pre-aggregates each user's full year of listening into a small per-user record overnight, so that the December 1 traffic spike reads from indexed rollups instead of scanning a year of raw plays.

Intuition. When 700 million users hit "show my Wrapped" on December 1, you cannot scan a year of 365 billion play events in real time. Instead, every night a batch job aggregates each user's listens into a single row keyed by user_id, indexed for instant lookup.

Case Study 3.4 Reading Time: 7 mins

Two Phases: Batch and Online

Spotify Wrapped runs in two phases. The batch phase runs overnight and aggregates a year of play events into per-user summaries. The online phase serves those summaries to 700 million clients when the feature launches. The two phases have completely different access patterns, and they need different indexes.

Phase 1: Offline Aggregation

Before December, Spotify walks the Listens table to tally each user's top tracks. Say you logged 1,000 listens this year. How quickly can the system find and process them?

Option A: Unclustered Storage

The table is stored chronologically, as songs were played. Even with an index on user_id, the engine has to retrieve your 1,000 plays from 1,000 different pages scattered across disk. That's 1,000 random seeks, which is catastrophic at scale. Calculating Wrapped for everyone would take weeks.

Option B: Clustered Storage

Spotify reorders the trillion-row Listens table by (user_id, listen_time), creating a clustered index on those columns (see clustered vs unclustered for the mechanism).

Now your 1,000 plays sit together on the same 64 MB page (or a small handful of neighbouring pages). The engine does a sequential scan: it reads your page, finds your data, and the next few records are guaranteed to also be yours. The same B+Tree index is also useful for date-specific lookups like "what did I play on July 12 at 9 PM," since the B+Tree gets you to the right spot and the clustering keeps nearby data nearby.

Phase 2: Online Delivery

Once summaries are computed, they're stored in a Wrapped table. Two different access patterns show up, and they each want a different index.

The Wrapped Card: Hash Index on `user_id`

When you tap your Wrapped card, the system fetches your precomputed summary. A hash index on user_id gives O(1) lookup, which is what you need when millions of users are tapping the card at the same second.

Recently Played: B+Tree on a Clustered Table

Some users drill deeper and want their full "Recently Played" history from the raw Listens table. A hash index can't handle a range scan like "show all songs I played in December." The B+Tree on the clustered Listens table can: it finds the start of December quickly, and because the data is clustered, the scroll across leaves reads contiguous pages.

The Clustering Tradeoff: One Dimension at a Time

Clustering is one dimension only, so Spotify must choose what Listens is physically sorted for:

Clustering by (user_id, listen_time) makes Wrapped fast and personal-history deep-dives fast.
Clustering by (artist_id, user_id) would make "Top 10 Fans of Taylor Swift" fast instead, but Wrapped becomes unclustered (and therefore slow).

At Spotify's scale, the answer is to keep multiple physical copies of Listens, each clustered differently for a different feature. Storage is cheap; disk seeks at query time are not.

Takeaway: Wrapped is instant because the expensive aggregation runs overnight and the online path is a tiny lookup on a user_id-clustered table, the index chosen to match the dominant access pattern (clustering turns 1,000 random IOs into one).

Putting It All Together: The $550,000 Query → Four optimizations (indexes, columnar storage, compression, page size) stacked on a single query take it from $550K a year to under $400.