Case Study 3.4: The Engineering of "Spotify Wrapped"

Concept. Spotify Wrapped pre-aggregates each user's full year of listening into a small per-user record overnight, so the December 1 traffic spike reads from indexed rollups instead of scanning a year of raw plays.

Intuition. When 700 million users hit "show my Wrapped" on December 1, you cannot scan a year of 365 billion play events in real time. Instead, every night a batch job aggregates each user's listens into a single row keyed by user_id, indexed for instant lookup.

Case Study 3.4 Reading Time: 7 mins

How does Spotify survive its largest annual traffic spike?

Goal: Optimize data access using Hash Indexes, B+ Trees, and Physical Clustering.

Spotify Wrapped requires discrete infrastructure provisioning. The Batch Phase aggregates planetary-scale tables, while the Online Phase propagates isolated static payloads to 700 million clients.


Phase 1: The Offline Aggregation

Before December, Spotify sifts through the Listens Table to tally each user's top tracks. Say you logged 1,000 listens this year.

Option 1: Unclustered

The table is stored chronologically, as songs were played.

  • The Aggregation Bottleneck: Even with an index on user_id, the engine would likely need to retrieve your 1,000 plays from 1,000 different pages across the disk. This leads to massive random I/O. Calculating Wrapped for everyone would drag on for weeks.

Option 2: Clustered

Spotify reorders the trillion-row table by (user_id, listen_time), creating a Clustered Index for the Wrapped query on these columns.

  • The I/O Win: Your 1,000 plays are now consolidated on the same physical page (64 MBs) after sorting. The engine performs a Sequential Scan. It reads your page, finds your data, and the next few records are guaranteed to be yours.

  • Why a B+ Tree Index here?: It's essential for Date-Specific Lookups. If you want to pinpoint what you played on "July 12th at 9 PM," the B+ Tree lets you jump directly to that page. Sorting clusters the data; the Index gets you to the right spot.


Phase 2: The Online Delivery

Once summaries are computed, they're stored in a Wrapped table.

1. The Wrapped Card: Hash Index

When you tap your personal Wrapped card, the system retrieves your precomputed summary.

  • The Choice: Hash Index on user_id.

  • The Logic: A Hash Index offers $O(1)$ performance. It's designed for handling simultaneous requests from millions of users accessing the app.

2. The "Internals": B+ Tree Index

Some users want to delve deeper and see their full "Recently Played" history from the raw Listens table.

  • The Problem: A Hash Index can't handle a range of records (e.g., "See all songs I played in December").

  • The Choice: B+ Tree Index on the Clustered Listens table.

  • The Win: The B+ Tree locates the "start" of December and scans through the leaf nodes. Since the data is Clustered, this scroll is seamless, like reading a single file.


The Tradeoff: The One-Dimension Rule

A table can only be clustered (sorted) by one dimension at a time. Clustering is a physical sort, so data can't sit in two different orders simultaneously.

  • Sorted by User (Wrapped Mode): Data is stored as (user_id, listen_time). This makes Wrapped calculations and personal history deep-dives instant ($O(N)$ sequential scan).

  • Sorted by Artist (Discography Mode): If Spotify wanted to identify the "Top 10 Fans of Taylor Swift," clustering by (artist_id, user_id) would be optimal.

  • The Conflict: If you cluster by Artist, computing a single user's Wrapped becomes Unclustered (and slow).

The Solution: At Spotify's scale, they often maintain multiple copies of the same table, each clustered differently to support various features.


Final Takeaway: "Clustering" handles the heavy lifting of batch processing. "Hashing" is for instant search. "B+ Trees" navigate through time and data landscapes.