Spotify End-to-End: How Everything Connects
One Play Button, Every Algorithm
When you tap play on "Anti-Hero," that single click uses every algorithm we learned in this course.
The Journey
Act 1: The Click (OLTP)
You tap play. In 8 milliseconds, this happens:
INSERT INTO listens (user_id, song_id, timestamp)
VALUES (123456, 789, NOW());
Algorithm: LSM Tree (lsm-indexes.md)
-
Write goes to MemTable (in-memory B+Tree)
-
No disk I/O = instant response
-
Background flush to SSTables when full
Why LSM not B+Tree? Spotify has 100K writes/second. B+Trees require random disk writes. LSM does sequential writes only.
Act 2: The Transform (ETL)
At 2 AM, Spark wakes up. 8.6 billion events from yesterday need processing.
Algorithms: Every One We Learned
# 1. Hash Partition (hash-partitioning.md)
listens.repartition(1000, "user_id") # 1000 partitions
# 2. BigSort (big-sort.md)
.sortWithinPartitions("timestamp") # External merge sort
# 3. Broadcast Join (distributed-query.md)
.join(broadcast(songs), "song_id") # Small table to all nodes
# 4. Write Columnar (storage-layout.md)
.write.parquet("s3://warehouse/") # Analytics-optimized
Why These Algorithms?
-
Hash Partition: Distribute 100TB across 1000 machines
-
BigSort: Each partition sorts its 100GB chunk
-
Broadcast Join: 100MB song table copied to all nodes
-
Columnar: Compress 100TB → 10TB for analytics
Act 3: The Dashboard (OLAP)
CEO opens mobile app, sees real-time "Top Artists" dashboard.
Algorithm: Broadcast Join + Columnar Scan (storage-layout.md, distributed-query.md)
-- Executive Query: "Top Artists by Country This Month"
SELECT artist, country, COUNT(*) as plays
FROM fact_plays f
JOIN dim_artists a ON f.artist_id = a.artist_id -- Broadcast
WHERE f.date >= '2024-01-01'
GROUP BY artist, country
ORDER BY plays DESC LIMIT 100;
Why Broadcast Join? Artist table is 10MB, fits in memory on all 1000 nodes. No shuffle needed!
Why Columnar? Only scan 3 columns (artist_id, country, date) instead of all 20 columns.
Algorithm Scorecard: Every Concept From This Course
See how one click uses every algorithm we learned:
| Algorithm | OLTP (Blue) | ETL (Orange) | OLAP (Green) | Course Page |
|---|---|---|---|---|
| LSM Tree | ✅ Write to MemTable | ⚫ Not used | ⚫ Not used | lsm-indexes.md |
| Hash Partitioning | ⚫ Single node | ✅ Distribute by user_id | ⚫ Pre-partitioned | hash-partitioning.md |
| External Sort | ⚫ In-memory only | ✅ BigSort 100GB chunks | ⚫ Already sorted | big-sort.md |
| Broadcast Join | ⚫ Single node | ✅ Song metadata | ✅ Artist dimensions | distributed-query.md |
| Columnar Storage | ⚫ Row-oriented | ⚫ Temporary files | ✅ Parquet compression | storage-layout.md |
| B+Tree Index | ✅ Query acceleration | ⚫ Full scans | ✅ Partition indexes | btree-indexes.md |
| Two-Phase Agg | ⚫ Single node | ✅ Local then global | ✅ Partition then merge | distributed-query.md |
| Cost Optimization | ✅ Index selection | ✅ Join order | ✅ Query plans | query-optimizer.md |
The Pattern: Each system optimizes for different bottlenecks using the same fundamental algorithms!
System Comparison: Three Specialized Systems
| Aspect | OLTP (Blue) | ETL (Orange) | OLAP (Green) |
|---|---|---|---|
| Purpose | Handle clicks | Transform data | Answer questions |
| Storage | Row-oriented LSM | Distributed files | Columnar (Parquet) |
| Key Algorithm | MemTable writes | Hash partition + Sort | Broadcast join |
| Scale | 100K writes/sec | 100TB/day | 1PB warehouse |
| Latency | < 10ms | 3 hours | 2-30 seconds |
| Optimization | Write throughput | Parallel processing | Query performance |
Data Flow: Click → OLTP (8.6B/day) → ETL (100TB nightly) → OLAP (1PB warehouse) → Dashboard (2s queries)
Beyond Spotify: The Universal Pattern
Every major tech company uses this same three-system pattern:
Netflix: View events → Cassandra → Spark → Druid
Uber: Trip events → Schemaless → Spark → Presto
Airbnb: Bookings → MySQL → Airflow → Druid
What You've Learned
You now understand modern data architecture! This isn't just Spotify - it's how every major tech company handles data at scale.
Key Insights:
-
One click → Three systems - OLTP for speed, ETL for scale, OLAP for insights
-
Same algorithms everywhere - Hash partitioning, external sort, broadcast joins used across all systems
-
Storage follows access - Row store for writes, columnar for analytics
-
Cost models drive decisions - The IO cost formulas you learned predict real system performance
Course Pages Connected:
-
LSM Trees → OLTP write performance
-
Hash Partitioning → ETL data distribution
-
Storage Layout → OLAP columnar efficiency
-
Query Optimizer → All three systems use cost-based decisions
This is the capstone - every algorithm we studied powers the modern data stack!