Docs
Slides
CS145: Your SQL Career Journey
Everyone masters SQL + Systems → Choose your specialization → Reach your destination
CS145 COURSE CONTENT
CS145: Your Data Journey
Everyone masters SQL + Systems → Choose your specialization → Reach your destination
CS145
START HERE
CORE CURRICULUM (1)
Intermediate SQL + AI
Subqueries, CTEs, Window Fns
Systems
How DBs work internally
Query Optimizers
Performance & Cost
PROJECT 1
SQL Mastery
BigQuery on 10s of GBs
CORE CURRICULUM (2)
Basic Transactions
ACID properties
Distributed Systems
Scaling & CAP theorem
How to Use DBs? Cost Them?
PROJECT 2
(SQL Path)
Advanced SQL
BigQuery
Platform Design, Build DBs
PROJECT 2
(Columnar DB)
Build Systems
Analyst/Scientist Path
• Python, Pandas, R
• Dashboards (Tableau)
• ML: scikit-learn
• Snowflake, Redshift
• Statistical Analysis
Engineer Path
• C++/Rust/Go
• Kubernetes, Docker
• Database Internals
• Consensus Protocols
• Distributed Systems
Data Analyst
Spotify • Netflix
Data Scientist
Meta • Google
Founder/CTO
[Your Startup]
Backend Engineer
Stripe • Uber
Platform Engineer
Google • Amazon
Legend
SQL Topics
Systems Topics
Projects
Career
💡 Key Insight:
Core Curriculum (1) → Project 1 → Core Curriculum (2) → Project 2 choice determines specialization
→ Skills beyond CS145 → Career destination. The entire CS145 journey (in the dotted box) prepares you for your chosen path.
Projects
After completing CS145 ...
Sample FAANG Interview Questions
Common questions you'll be ready to tackle after CS145:
SQL & Analytics Questions
1. Amazon - Top Products Query
Write a query to find products that generated >$1M revenue in Q4 but <$100K in Q1
Follow-up: Now optimize it for a 10TB orders table. What indexes would you create?
Hints: Tables: orders(order_id, product_id, amount, date), products(product_id, name) • Date filtering with EXTRACT/QUARTER • GROUP BY with HAVING
2. Meta - Friend Recommendations
Find users who are friends-of-friends but not direct friends, ranked by mutual friend count
Follow-up: The query takes 30 seconds on 1M users. How do you optimize for 500M users?
Hints: friendships(user1_id, user2_id) bidirectional • Self-joins • COUNT DISTINCT • LEFT JOIN with NULL check
3. Netflix - Binge Watching Detection
Identify users who watched >3 episodes of the same show within any 24-hour window
Follow-up: How would you handle episodes watched across midnight? What about different time zones?
Hints: views(user_id, show_id, episode_id, timestamp) • Window functions (LAG/LEAD) • Timestamp differences
Systems Design Questions
4. Google - URL Deduplication
Design a system to check if a URL has been crawled before. Handle 100B URLs.
Follow-up: How do you handle URL normalization? (www vs non-www, http vs https, trailing slash)
Hints: 100B URLs × 100 bytes = 10TB (won't fit in memory) • Bloom filters (1% false positive ok) • Distributed hash tables
5. Uber - Distributed Event Counter
Design a real-time counter for rides across 5 data centers. Requirements: <1s lag, 99.99% uptime
Follow-up: Would you choose accuracy or availability during a network partition? Why?
Hints: Network partitions between DCs • Eventually consistent counters • CRDTs • Write-through cache
You'll read, debug, and write queries like...
Example: Finding Power Users
📋 Complex Production Query
WITH user_stats AS (
SELECT
user_id,
COUNT(DISTINCT song_id) as unique_songs,
COUNT(DISTINCT genre) as unique_genres,
SUM(play_duration) as total_minutes
FROM listens l
JOIN songs s ON l.song_id = s.song_id
GROUP BY user_id
),
power_users AS (
SELECT
user_id,
unique_songs,
unique_genres,
total_minutes,
NTILE(10) OVER (ORDER BY total_minutes DESC) as decile
FROM user_stats
WHERE unique_genres >= 3
)
SELECT
u.name,
p.unique_songs,
p.unique_genres,
ROUND(p.total_minutes / 60.0, 1) as total_hours
FROM power_users p
JOIN users u ON p.user_id = u.user_id
WHERE p.decile = 1 -- Top 10% only
ORDER BY p.total_minutes DESC;
Where This Journey Leads: Real-World Architectures
Enterprise Scale: Spotify's Data Pipeline
See how one play event travels through OLTP → ETL → OLAP using the algorithms you'll learn in this course:
Spotify: From Click to Dashboard
How one play event travels through OLTP → ETL → OLAP using our algorithms
OLTP (Real-time)
User App
Spotify Mobile
👤
User clicks play
"Anti-Hero"
API Server
100K req/sec
GET /recent
MemTable
O(1) writes
L0: SSTables
Sorted, 64MB each
L1: Compacted
Sorted, 640MB
L2: Compacted
6.4GB files
OLTP Write Path (LSM Tree)
INSERT INTO listens (user_id, song_id, timestamp)
VALUES (123, 789, '2024-01-15 09:30:00');
1. Write to MemTable (in-memory B-tree)
2. Flush to SSTable when full (64MB)
3. Background compaction merges levels
OLTP Metrics
• Writes: 100K/sec
• Latency: < 10ms
• Storage: Row format
Why LSM?
• Write-optimized
• Sequential writes only
• Handles bursts well
Daily Volume
8.6B events
ETL (Nightly Batch)
Extract
Read SSTables
Hash Partition
By user_id
BigSort
By timestamp
Hash Join
With dimensions
Load
Write Parquet
Spark ETL Pipeline
val listens = spark.read.parquet("s3://oltp-export/listens")
val songs = spark.read.parquet("s3://dims/songs").broadcast
listens.repartition($"user_id") // Hash partition
.sortWithinPartitions($"timestamp") // BigSort
.join(songs, "song_id") // Broadcast hash join
.write.partitionBy("date").parquet("s3://warehouse")
ETL Metrics
• Duration: 3 hours
• Cluster: 100 nodes
• Process: 100TB/day
Algorithms Used
• Hash Partitioning
• External Sort (BigSort)
• Broadcast Hash Join
Processed
100TB/day
OLAP (Analytics)
Star Schema
Fact: Plays
10B rows
user_id, song_id, timestamp
Dim: Songs
100K rows
Dim: Users
500M rows
Dim: Artists
10K rows
Query: Top Artists by Country
1. Broadcast small dimensions (Artists, Countries)
2. Scan fact table partitions (columnar, compressed)
3. Local aggregation per node
4. Global merge and sort
Executive Dashboard
Drake: 2.1B plays ████████████
Taylor: 1.8B plays ██████████
Bad Bunny: 1.5B plays ████████
OLAP Metrics
• Query time: 2-30s
• Storage: Columnar
• Compression: 10:1
Why Columnar?
• Scan only needed cols
• Better compression
• SIMD operations
Warehouse
1PB total
Real-time
Batch load
OLTP: Transactional, row-store, LSM for writes
ETL: Batch processing with Spark, our algorithms at scale
OLAP: Analytics, columnar store, star schema
Startup Journey: From MVP to Scale
See how a food delivery startup grows from PostgreSQL → Distributed Systems using the concepts you'll master:
FoodRush: Your Food Delivery Startup
Real architecture decisions at each growth stage
Phase 1: MVP (0-10K users)
"Just Ship It" - $25/month
Postgres
Supa/Vercel: Free tier: 500MB, 2 CPU
Pooled connections included
500 orders/day
Simple Schema
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT, restaurant_id INT,
items JSONB, status VARCHAR,
created_at TIMESTAMP);
B-tree Indexes
• user_id (point queries)
• restaurant_id
• created_at (ranges)
Simple Queries
• INSERT order
• UPDATE status
• SELECT by user
MVP Metrics
Queries: 100/sec | Latency: 50ms | Storage: 500MB | Uptime: 99%
Cost: Supabase $25/month (or Vercel/Neon free tier)
Why This Works
✓ Dead simple - one database, ACID guarantees
✓ B-trees handle both point lookups and range scans
✓ PostgreSQL can handle 10K TPS on good hardware
Phase 2: Growth (10K-100K users)
"We need to scale!" - $500/month
Load Balancer
API Server 1
API Server 2
Redis Cache
Hot data (2GB)
Postgres Primary
r5.xlarge (32GB)
Read Replica 1
Read Replica 2
MongoDB
Menus, Reviews
What's New
• Read replicas: Distribute read load (5000 orders/day)
• Redis: Cache restaurant data (menus change rarely)
• MongoDB: Flexible schema for restaurant content
• Connection pooling: PgBouncer for 1000+ connections
Growth Metrics
Queries: 1000/sec | Latency: 15ms | Storage: 50GB | Uptime: 99.9%
Cost: $200 RDS + $100 Replicas + $50 Redis + $100 MongoDB = $450/month
Algorithms From Our Course
B-trees (Postgres) | LRU eviction (Redis) | Hash indexes (MongoDB) | Read-after-write consistency
Phase 3: Scale (100K-1M users)
"Global Scale" - $5,000/month
Google Spanner
Global distributed SQL
Auto-sharding, 99.999% SLA
1M users, 3 regions
BigQuery
Analytics warehouse
Serverless, pay per query
Dataflow (Optional)
Streaming ETL
Spanner → BigQuery
Why Spanner?
• No manual sharding needed
• Strong consistency globally
Spanner Handles For You
✓ Automatic sharding and rebalancing
✓ Synchronous replication across regions
✓ ACID transactions globally
Scale Metrics
Orders: 50K/day | Peak: 5000/sec | Storage: 1TB | Uptime: 99.999%
Cost: $3K Spanner + $1K BigQuery + $500 Dataflow = $4.5K/month
Team: 5 engineers (no DBAs needed!)
10K users
100K users
Complexity:
Simple
Moderate
Complex (Distributed)
Optional Reading: Quickly Scan Real-World Data Architecture Decisions
Learn from engineering teams at scale - how they chose and evolved their data systems:
From SQL to Distributed Systems
How Uber Serves Over 40 Million Reads Per Second
Uber's journey from MySQL → BigQuery, Spanner, and PrestoSQL. See how they handle massive scale with integrated caching.
Spotify's Data Platform Explained
Deep dive into Spotify's MySQL and BigQuery infrastructure. Learn how they process billions of events daily.
Building and Scaling Notion's Data Lake
How Notion scaled PostgreSQL to handle their collaborative workspace data. Real-world PostgreSQL optimization at scale.
The NoSQL Reality Check
HBase Deprecation at Pinterest
Why Pinterest moved away from NoSQL back to SQL systems. A case study in choosing the right tool for the job.
Why NoSQL Deployments Are Failing at Scale
Industry analysis on NoSQL limitations and why companies are returning to SQL-based systems.