Spotify End-to-End: How Everything Connects

One Play Button, Every Algorithm

When you tap play on "Anti-Hero," that single click uses every algorithm we learned in this course.

1 Click
User action
3 Systems
OLTP → ETL → OLAP
8 Algorithms
From this course
2 Seconds
To CEO dashboard

The Journey

One Click, Three Systems, Every Algorithm Act 1: The Click (OLTP) User plays "Anti-Hero" LSM Tree Write 1. MemTable (in-memory) 2. Flush to SSTable 3. Background compaction From: lsm-indexes.md 100K writes/sec, <10ms 8.6B events/day Act 2: The Transform (ETL) Nightly batch processes day's data Spark Pipeline 1. Hash Partition by user_id 2. BigSort by timestamp 3. Broadcast Join dimensions 4. Write to columnar format From: hash-partitioning, big-sort 3 hours, 100 nodes 100TB processed Act 3: The Dashboard (OLAP) CEO sees results in real-time Star Schema Query 1. Broadcast small tables 2. Columnar scan fact table 3. Two-phase aggregation From: storage-layout, query-optimizer Top Artists Dashboard 2-30 second queries 1PB warehouse Real-time Nightly

Act 1: The Click (OLTP)

You tap play. In 8 milliseconds, this happens:

INSERT INTO listens (user_id, song_id, timestamp) 
VALUES (123456, 789, NOW());

Algorithm: LSM Tree (lsm-indexes.md)

Why LSM not B+Tree? Spotify has 100K writes/second. B+Trees require random disk writes. LSM does sequential writes only.

< 8ms
Response time
100K/sec
Writes per node
8.6B
Events per day

Act 2: The Transform (ETL)

At 2 AM, Spark wakes up. 8.6 billion events from yesterday need processing.

Algorithms: Every One We Learned

# 1. Hash Partition (hash-partitioning.md)
listens.repartition(1000, "user_id")  # 1000 partitions

# 2. BigSort (big-sort.md) 
.sortWithinPartitions("timestamp")     # External merge sort

# 3. Broadcast Join (distributed-query.md)
.join(broadcast(songs), "song_id")     # Small table to all nodes

# 4. Write Columnar (storage-layout.md)
.write.parquet("s3://warehouse/")      # Analytics-optimized

Why These Algorithms?

3 Hours
Total pipeline time
1000 Nodes
Spark cluster size
100TB
Daily data processed

Act 3: The Dashboard (OLAP)

CEO opens mobile app, sees real-time "Top Artists" dashboard.

Algorithm: Broadcast Join + Columnar Scan (storage-layout.md, distributed-query.md)

-- Executive Query: "Top Artists by Country This Month"
SELECT artist, country, COUNT(*) as plays
FROM fact_plays f 
JOIN dim_artists a ON f.artist_id = a.artist_id  -- Broadcast
WHERE f.date >= '2024-01-01'
GROUP BY artist, country
ORDER BY plays DESC LIMIT 100;

Why Broadcast Join? Artist table is 10MB, fits in memory on all 1000 nodes. No shuffle needed!

Why Columnar? Only scan 3 columns (artist_id, country, date) instead of all 20 columns.

2 Seconds
Query time
3/20 Columns
Columnar efficiency
1PB
Warehouse size

Algorithm Scorecard: Every Concept From This Course

See how one click uses every algorithm we learned:

Algorithm OLTP (Blue) ETL (Orange) OLAP (Green) Course Page
LSM Tree ✅ Write to MemTable ⚫ Not used ⚫ Not used lsm-indexes.md
Hash Partitioning ⚫ Single node ✅ Distribute by user_id ⚫ Pre-partitioned hash-partitioning.md
External Sort ⚫ In-memory only ✅ BigSort 100GB chunks ⚫ Already sorted big-sort.md
Broadcast Join ⚫ Single node ✅ Song metadata ✅ Artist dimensions distributed-query.md
Columnar Storage ⚫ Row-oriented ⚫ Temporary files ✅ Parquet compression storage-layout.md
B+Tree Index ✅ Query acceleration ⚫ Full scans ✅ Partition indexes btree-indexes.md
Two-Phase Agg ⚫ Single node ✅ Local then global ✅ Partition then merge distributed-query.md
Cost Optimization ✅ Index selection ✅ Join order ✅ Query plans query-optimizer.md

The Pattern: Each system optimizes for different bottlenecks using the same fundamental algorithms!


System Comparison: Three Specialized Systems

Aspect OLTP (Blue) ETL (Orange) OLAP (Green)
Purpose Handle clicks Transform data Answer questions
Storage Row-oriented LSM Distributed files Columnar (Parquet)
Key Algorithm MemTable writes Hash partition + Sort Broadcast join
Scale 100K writes/sec 100TB/day 1PB warehouse
Latency < 10ms 3 hours 2-30 seconds
Optimization Write throughput Parallel processing Query performance

Data Flow: Click → OLTP (8.6B/day) → ETL (100TB nightly) → OLAP (1PB warehouse) → Dashboard (2s queries)


Beyond Spotify: The Universal Pattern

Every major tech company uses this same three-system pattern:

Netflix: View events → Cassandra → Spark → Druid
Uber: Trip events → Schemaless → Spark → Presto
Airbnb: Bookings → MySQL → Airflow → Druid


What You've Learned

You now understand modern data architecture! This isn't just Spotify - it's how every major tech company handles data at scale.

Key Insights:

  1. One click → Three systems - OLTP for speed, ETL for scale, OLAP for insights

  2. Same algorithms everywhere - Hash partitioning, external sort, broadcast joins used across all systems

  3. Storage follows access - Row store for writes, columnar for analytics

  4. Cost models drive decisions - The IO cost formulas you learned predict real system performance

Course Pages Connected:

This is the capstone - every algorithm we studied powers the modern data stack!