Spotify End-to-End: How Everything Connects
Concept. A single click on Spotify's play button is recorded, reshaped, and analyzed by three different storage strategies in turn: a write-optimized store for the click, a batch pipeline that reshapes it, and a scan-optimized warehouse that answers dashboards. Each stage uses the index and storage built for its workload.
Intuition. Mickey taps play. The click appends to an LSM events store in about 8 ms and returns. Overnight, a Spark ETL job reshapes a day of 8.6 billion events into a columnar warehouse. The next morning the "Top Artists" dashboard answers from that warehouse in about 2 seconds. Same event, three systems, three very different latency budgets.
The Journey
Figure 1. A single tap on play fans out into three systems in sequence. Act 1 is the OLTP write that appends the play to an LSM tree; Act 2 is the ETL pipeline where Spark hash-partitions and BigSorts the events, a broadcast join enriches them, and Spark writes the result columnar; Act 3 is the OLAP query that scans the columnar star schema with a broadcast join to answer the dashboard in about two seconds. Every block in this path is a primitive built earlier in the course, now wired end to end.
Act 1 · The Click (OLTP)
You tap play. In ~8 ms, this happens:
INSERT INTO listens (user_id, song_id, timestamp)
VALUES (123456, 789, NOW());
The write lands in an in-memory MemTable, returns instantly, and is flushed to disk later as part of an SSTable. No random disk I/O on the hot path. That's why Spotify can sustain 100 K writes/sec per node.
→ Algorithm: LSM Tree (lsm-indexes.html)
8 ms response · 100K writes/sec · 8.6B events/day
Act 2 · The Transform (ETL)
At 2 AM, Spark kicks in. 8.6 billion events from yesterday need processing into the analytics warehouse.
# 1. Hash Partition → spread across 1000 nodes
listens.repartition(1000, "user_id")
# 2. BigSort → sort within each partition
.sortWithinPartitions("timestamp")
# 3. Broadcast Join → small dim tables to every node
.join(broadcast(songs), "song_id")
# 4. Write Columnar → compress 100 TB → 10 TB
.write.parquet("s3://warehouse/")
Each step is one of the algorithms from M3:
| Step | Algorithm | Page |
|---|---|---|
Repartition by user_id |
Hash Partitioning | hash-partitioning.html |
| Sort within partitions | BigSort | big-sort.html |
| Join dim tables | Broadcast Join | distributed-query.html |
| Write Parquet | Columnar storage | storage-layout.html |
3 hours total · 1000-node Spark cluster · 100 TB processed per day
Act 3 · The Dashboard (OLAP)
CEO opens the mobile app. The "Top Artists" dashboard renders in 2 seconds:
-- "Top Artists by Country This Month"
SELECT artist, country, COUNT(*) AS plays
FROM fact_plays f
JOIN dim_artists a ON f.artist_id = a.artist_id -- broadcast (10 MB)
WHERE f.date >= '2024-01-01'
GROUP BY artist, country
ORDER BY plays DESC
LIMIT 100;
The 10 MB dim_artists is broadcast to all 1000 nodes (no shuffle). The 1 PB fact_plays is scanned columnar. Only artist_id, country, date are read out of 20 columns.
→ Algorithms: Broadcast Join + Columnar Scan (distributed-query.html, storage-layout.html)
2-second query · 3 of 20 columns scanned · 1 PB warehouse
The Pattern
One pipeline, opposite optimizations at each end. Act 1 needs blazing writes, so the click lands in an LSM tree with no random I/O and returns in under 10 ms. Act 3 needs blazing scans, so the dashboard reads compressed columnar storage and answers in about 2 seconds. Act 2 is the 3-hour nightly ETL that converts the first format into the second. No single structure wins everywhere; each stage uses the one built for its workload.
Next: Module 3 Capstone takes the same "right tool for the job" mindset and applies it to seven open-ended Spotify-style design problems.