Spotify End-to-End: How Everything Connects
One Play Button, Every Algorithm
When you tap play on "Anti-Hero," that single click uses every algorithm we learned in this course.
The Journey
Act 1: The Click (OLTP)
You tap play. In 8 milliseconds, this happens:
INSERT INTO listens (user_id, song_id, timestamp)
VALUES (123456, 789, NOW());
Algorithm: LSM Tree (lsm-indexes.md)
-
Write goes to MemTable (in-memory B+Tree)
-
No disk I/O = instant response
-
Background flush to SSTables when full
Why LSM not B+Tree? Spotify has 100K writes/second. B+Trees require random disk writes. LSM does sequential writes only.
Act 2: The Transform (ETL)
At 2 AM, Spark wakes up. 8.6 billion events from yesterday need processing.
Algorithms: Every One We Learned
# 1. Hash Partition (hash-partitioning.md)
listens.repartition(1000, "user_id") # 1000 partitions
# 2. BigSort (big-sort.md)
.sortWithinPartitions("timestamp") # External merge sort
# 3. Broadcast Join (distributed-query.md)
.join(broadcast(songs), "song_id") # Small table to all nodes
# 4. Write Columnar (storage-layout.md)
.write.parquet("s3://warehouse/") # Analytics-optimized
Why These Algorithms?
-
Hash Partition: Distribute 100TB across 1000 machines
-
BigSort: Each partition sorts its 100GB chunk
-
Broadcast Join: 100MB song table copied to all nodes
-
Columnar: Compress 100TB → 10TB for analytics
Act 3: The Dashboard (OLAP)
CEO opens mobile app, sees real-time "Top Artists" dashboard.
Algorithm: Broadcast Join + Columnar Scan (storage-layout.md, distributed-query.md)
-- Executive Query: "Top Artists by Country This Month"
SELECT artist, country, COUNT(*) as plays
FROM fact_plays f
JOIN dim_artists a ON f.artist_id = a.artist_id -- Broadcast
WHERE f.date >= '2024-01-01'
GROUP BY artist, country
ORDER BY plays DESC LIMIT 100;
Why Broadcast Join? Artist table is 10MB, fits in memory on all 1000 nodes. No shuffle needed!
Why Columnar? Only scan 3 columns (artist_id, country, date) instead of all 20 columns.