Spotify End-to-End: How Everything Connects

Concept. A single click on Spotify's play button is recorded, reshaped, and analyzed by three different storage strategies in turn: a write-optimized store for the click, a batch pipeline that reshapes it, and a scan-optimized warehouse that answers dashboards. Each stage uses the index and storage built for its workload.

Intuition. Mickey taps play. The click appends to an LSM events store in about 8 ms and returns. Overnight, a Spark ETL job reshapes a day of 8.6 billion events into a columnar warehouse. The next morning the "Top Artists" dashboard answers from that warehouse in about 2 seconds. Same event, three systems, three very different latency budgets.

The Journey

Spotify end-to-end architecture: one click triggers three systems. Act 1 OLTP (LSM tree write), Act 2 ETL (Spark hash partition, BigSort, broadcast join, columnar write), Act 3 OLAP (star schema query with broadcast and columnar scan).

Figure 1. A single tap on play fans out into three systems in sequence. Act 1 is the OLTP write that appends the play to an LSM tree; Act 2 is the ETL pipeline where Spark hash-partitions and BigSorts the events, a broadcast join enriches them, and Spark writes the result columnar; Act 3 is the OLAP query that scans the columnar star schema with a broadcast join to answer the dashboard in about two seconds. Every block in this path is a primitive built earlier in the course, now wired end to end.


Act 1 · The Click (OLTP)

You tap play. In ~8 ms, this happens:

INSERT INTO listens (user_id, song_id, timestamp)
VALUES (123456, 789, NOW());

The write lands in an in-memory MemTable, returns instantly, and is flushed to disk later as part of an SSTable. No random disk I/O on the hot path. That's why Spotify can sustain 100 K writes/sec per node.

→ Algorithm: LSM Tree (lsm-indexes.html)

8 ms response · 100K writes/sec · 8.6B events/day


Act 2 · The Transform (ETL)

At 2 AM, Spark kicks in. 8.6 billion events from yesterday need processing into the analytics warehouse.

# 1. Hash Partition  →  spread across 1000 nodes
listens.repartition(1000, "user_id")

# 2. BigSort         →  sort within each partition
       .sortWithinPartitions("timestamp")

# 3. Broadcast Join  →  small dim tables to every node
       .join(broadcast(songs), "song_id")

# 4. Write Columnar  →  compress 100 TB → 10 TB
       .write.parquet("s3://warehouse/")

Each step is one of the algorithms from M3:

Step Algorithm Page
Repartition by user_id Hash Partitioning hash-partitioning.html
Sort within partitions BigSort big-sort.html
Join dim tables Broadcast Join distributed-query.html
Write Parquet Columnar storage storage-layout.html

3 hours total · 1000-node Spark cluster · 100 TB processed per day


Act 3 · The Dashboard (OLAP)

CEO opens the mobile app. The "Top Artists" dashboard renders in 2 seconds:

-- "Top Artists by Country This Month"
SELECT artist, country, COUNT(*) AS plays
FROM fact_plays f
JOIN dim_artists a ON f.artist_id = a.artist_id   -- broadcast (10 MB)
WHERE f.date >= '2024-01-01'
GROUP BY artist, country
ORDER BY plays DESC
LIMIT 100;

The 10 MB dim_artists is broadcast to all 1000 nodes (no shuffle). The 1 PB fact_plays is scanned columnar. Only artist_id, country, date are read out of 20 columns.

→ Algorithms: Broadcast Join + Columnar Scan (distributed-query.html, storage-layout.html)

2-second query · 3 of 20 columns scanned · 1 PB warehouse


The Pattern

One pipeline, opposite optimizations at each end. Act 1 needs blazing writes, so the click lands in an LSM tree with no random I/O and returns in under 10 ms. Act 3 needs blazing scans, so the dashboard reads compressed columnar storage and answers in about 2 seconds. Act 2 is the 3-hour nightly ETL that converts the first format into the second. No single structure wins everywhere; each stage uses the one built for its workload.

Next: Module 3 Capstone takes the same "right tool for the job" mindset and applies it to seven open-ended Spotify-style design problems.