Spotify End-to-End: How Everything Connects

Concept. A single click on Spotify's play button triggers a coordinated cascade across multiple subsystems — auth, recommendation, search, playback — each backed by a different storage and indexing strategy chosen for its specific workload.

Intuition. When Mickey hits play, Postgres authenticates him, an LSM-backed events store records the click, a vector search returns recommended next songs, a B+Tree-indexed catalog fetches the audio file, and a CDN streams it — all within 2 seconds.

One Play Button, Every Algorithm

When you hit play on "Anti-Hero," that single click sets off a chain reaction through every algorithm we've covered.

1 Click

User action

3 Systems

OLTP → ETL → OLAP

8 Algorithms

From this course

2 Seconds

To CEO dashboard

The Journey

One Click, Three Systems, Every Algorithm Act 1: The Click (OLTP) User plays "Anti-Hero" LSM Tree Write 1. MemTable (in-memory) 2. Flush to SSTable 3. Background compaction From: lsm-indexes.md 100K writes/sec, <10ms 8.6B events/day Act 2: The Transform (ETL) Nightly batch processes day's data Spark Pipeline 1. Hash Partition by user_id 2. BigSort by timestamp 3. Broadcast Join dimensions 4. Write to columnar format From: hash-partitioning, big-sort 3 hours, 100 nodes 100TB processed Act 3: The Dashboard (OLAP) CEO sees results in real-time Star Schema Query 1. Broadcast small tables 2. Columnar scan fact table 3. Two-phase aggregation From: storage-layout, query-optimizer Top Artists Dashboard 2-30 second queries 1PB warehouse Real-time Nightly

Act 1: The Click (OLTP)

You tap play. In 8 milliseconds, this happens:

INSERT INTO listens (user_id, song_id, timestamp) 
VALUES (123456, 789, NOW());

Algorithm: LSM Tree (lsm-indexes.md)

  • Write goes to MemTable (in-memory B+Tree)

  • No disk I/O = instant response

  • Background flush to SSTables when full

Why LSM not B+Tree? Spotify handles 100K writes/second. B+Trees demand random disk writes. LSM sticks to sequential writes.

< 8ms

Response time

100K/sec

Writes per node

8.6B

Events per day


Act 2: The Transform (ETL)

At 2 AM, Spark kicks in. 8.6 billion events from yesterday need processing.

Algorithms: Every One We Learned

# 1. Hash Partition (hash-partitioning.md)
listens.repartition(1000, "user_id")  # 1000 partitions

# 2. BigSort (big-sort.md) 
.sortWithinPartitions("timestamp")     # External merge sort

# 3. Broadcast Join (distributed-query.md)
.join(broadcast(songs), "song_id")     # Small table to all nodes

# 4. Write Columnar (storage-layout.md)
.write.parquet("s3://warehouse/")      # Analytics-optimized

Why These Algorithms?

  • Hash Partition: Spread 100TB across 1000 machines

  • BigSort: Each partition sorts its 100GB chunk

  • Broadcast Join: 100MB song table copied to all nodes

  • Columnar: Compress 100TB → 10TB for analytics

3 Hours

Total pipeline time

1000 Nodes

Spark cluster size

100TB

Daily data processed


Act 3: The Dashboard (OLAP)

CEO opens mobile app, sees real-time "Top Artists" dashboard.

Algorithm: Broadcast Join + Columnar Scan (storage-layout.md, distributed-query.md)

-- Executive Query: "Top Artists by Country This Month"
SELECT artist, country, COUNT(*) AS plays
FROM fact_plays f 
JOIN dim_artists a ON f.artist_id = a.artist_id  -- Broadcast
WHERE f.date >= '2024-01-01'
GROUP BY artist, country
ORDER BY plays DESC LIMIT 100;

Why Broadcast Join? Artist table is 10MB, fits in memory on all 1000 nodes. No shuffle needed!

Why Columnar? Only scan 3 columns (artist_id, country, date) instead of all 20 columns.

2 Seconds

Query time

3/20 Columns

Columnar efficiency

1PB

Warehouse size