CS145: Your SQL Career Journey

Everyone masters SQL + Systems → Choose your specialization → Reach your destination

CS145 COURSE CONTENT CS145: Your Data Journey Everyone masters SQL + Systems → Choose your specialization → Reach your destination CS145 START HERE CORE CURRICULUM (1) Intermediate SQL + AI Subqueries, CTEs, Window Fns Systems How DBs work internally Query Optimizers Performance & Cost PROJECT 1 SQL Mastery BigQuery on 10s of GBs CORE CURRICULUM (2) Basic Transactions ACID properties Distributed Systems Scaling & CAP theorem How to Use DBs? Cost Them? PROJECT 2 (SQL Path) Advanced SQL BigQuery Platform Design, Build DBs PROJECT 2 (Columnar DB) Build Systems Analyst/Scientist Path • Python, Pandas, R • Dashboards (Tableau) • ML: scikit-learn • Snowflake, Redshift • Statistical Analysis Engineer Path • C++/Rust/Go • Kubernetes, Docker • Database Internals • Consensus Protocols • Distributed Systems Data Analyst Spotify • Netflix Data Scientist Meta • Google Founder/CTO [Your Startup] Backend Engineer Stripe • Uber Platform Engineer Google • Amazon Legend SQL Topics Systems Topics Projects Career 💡 Key Insight: Core Curriculum (1) → Project 1 → Core Curriculum (2) → Project 2 choice determines specialization → Skills beyond CS145 → Career destination. The entire CS145 journey (in the dotted box) prepares you for your chosen path.

Projects


After completing CS145 ...

Sample FAANG Interview Questions

Common questions you'll be ready to tackle after CS145:

SQL & Analytics Questions

1. Amazon - Top Products Query

Write a query to find products that generated >$1M revenue in Q4 but <$100K in Q1
Follow-up: Now optimize it for a 10TB orders table. What indexes would you create?
Hints: Tables: orders(order_id, product_id, amount, date), products(product_id, name) • Date filtering with EXTRACT/QUARTER • GROUP BY with HAVING

2. Meta - Friend Recommendations

Find users who are friends-of-friends but not direct friends, ranked by mutual friend count
Follow-up: The query takes 30 seconds on 1M users. How do you optimize for 500M users?
Hints: friendships(user1_id, user2_id) bidirectional • Self-joins • COUNT DISTINCT • LEFT JOIN with NULL check

3. Netflix - Binge Watching Detection

Identify users who watched >3 episodes of the same show within any 24-hour window
Follow-up: How would you handle episodes watched across midnight? What about different time zones?
Hints: views(user_id, show_id, episode_id, timestamp) • Window functions (LAG/LEAD) • Timestamp differences

Systems Design Questions

4. Google - URL Deduplication

Design a system to check if a URL has been crawled before. Handle 100B URLs.
Follow-up: How do you handle URL normalization? (www vs non-www, http vs https, trailing slash)
Hints: 100B URLs × 100 bytes = 10TB (won't fit in memory) • Bloom filters (1% false positive ok) • Distributed hash tables

5. Uber - Distributed Event Counter

Design a real-time counter for rides across 5 data centers. Requirements: <1s lag, 99.99% uptime
Follow-up: Would you choose accuracy or availability during a network partition? Why?
Hints: Network partitions between DCs • Eventually consistent counters • CRDTs • Write-through cache

You'll read, debug, and write queries like...

Example: Finding Power Users

📋 Complex Production Query

WITH user_stats AS (
  SELECT 
    user_id,
    COUNT(DISTINCT song_id) as unique_songs,
    COUNT(DISTINCT genre) as unique_genres,
    SUM(play_duration) as total_minutes
  FROM listens l
  JOIN songs s ON l.song_id = s.song_id
  GROUP BY user_id
),
power_users AS (
  SELECT 
    user_id,
    unique_songs,
    unique_genres,
    total_minutes,
    NTILE(10) OVER (ORDER BY total_minutes DESC) as decile
  FROM user_stats
  WHERE unique_genres >= 3
)
SELECT 
  u.name,
  p.unique_songs,
  p.unique_genres,
  ROUND(p.total_minutes / 60.0, 1) as total_hours
FROM power_users p
JOIN users u ON p.user_id = u.user_id
WHERE p.decile = 1  -- Top 10% only
ORDER BY p.total_minutes DESC;

Where This Journey Leads: Real-World Architectures

Enterprise Scale: Spotify's Data Pipeline

See how one play event travels through OLTP → ETL → OLAP using the algorithms you'll learn in this course:

Spotify: From Click to Dashboard How one play event travels through OLTP → ETL → OLAP using our algorithms OLTP (Real-time) User App Spotify Mobile 👤 User clicks play "Anti-Hero" API Server 100K req/sec GET /recent MemTable O(1) writes L0: SSTables Sorted, 64MB each L1: Compacted Sorted, 640MB L2: Compacted 6.4GB files OLTP Write Path (LSM Tree) INSERT INTO listens (user_id, song_id, timestamp) VALUES (123, 789, '2024-01-15 09:30:00'); 1. Write to MemTable (in-memory B-tree) 2. Flush to SSTable when full (64MB) 3. Background compaction merges levels OLTP Metrics • Writes: 100K/sec • Latency: < 10ms • Storage: Row format Why LSM? • Write-optimized • Sequential writes only • Handles bursts well Daily Volume 8.6B events ETL (Nightly Batch) Extract Read SSTables Hash Partition By user_id BigSort By timestamp Hash Join With dimensions Load Write Parquet Spark ETL Pipeline val listens = spark.read.parquet("s3://oltp-export/listens") val songs = spark.read.parquet("s3://dims/songs").broadcast listens.repartition($"user_id") // Hash partition .sortWithinPartitions($"timestamp") // BigSort .join(songs, "song_id") // Broadcast hash join .write.partitionBy("date").parquet("s3://warehouse") ETL Metrics • Duration: 3 hours • Cluster: 100 nodes • Process: 100TB/day Algorithms Used • Hash Partitioning • External Sort (BigSort) • Broadcast Hash Join Processed 100TB/day OLAP (Analytics) Star Schema Fact: Plays 10B rows user_id, song_id, timestamp Dim: Songs 100K rows Dim: Users 500M rows Dim: Artists 10K rows Query: Top Artists by Country 1. Broadcast small dimensions (Artists, Countries) 2. Scan fact table partitions (columnar, compressed) 3. Local aggregation per node 4. Global merge and sort Executive Dashboard Drake: 2.1B plays ████████████ Taylor: 1.8B plays ██████████ Bad Bunny: 1.5B plays ████████ OLAP Metrics • Query time: 2-30s • Storage: Columnar • Compression: 10:1 Why Columnar? • Scan only needed cols • Better compression • SIMD operations Warehouse 1PB total Real-time Batch load OLTP: Transactional, row-store, LSM for writes ETL: Batch processing with Spark, our algorithms at scale OLAP: Analytics, columnar store, star schema

Startup Journey: From MVP to Scale

See how a food delivery startup grows from PostgreSQL → Distributed Systems using the concepts you'll master:

FoodRush: Your Food Delivery Startup Real architecture decisions at each growth stage Phase 1: MVP (0-10K users) "Just Ship It" - $25/month Postgres Supa/Vercel: Free tier: 500MB, 2 CPU Pooled connections included 500 orders/day Simple Schema CREATE TABLE orders ( id SERIAL PRIMARY KEY, user_id INT, restaurant_id INT, items JSONB, status VARCHAR, created_at TIMESTAMP); B-tree Indexes • user_id (point queries) • restaurant_id • created_at (ranges) Simple Queries • INSERT order • UPDATE status • SELECT by user MVP Metrics Queries: 100/sec | Latency: 50ms | Storage: 500MB | Uptime: 99% Cost: Supabase $25/month (or Vercel/Neon free tier) Why This Works ✓ Dead simple - one database, ACID guarantees ✓ B-trees handle both point lookups and range scans ✓ PostgreSQL can handle 10K TPS on good hardware Phase 2: Growth (10K-100K users) "We need to scale!" - $500/month Load Balancer API Server 1 API Server 2 Redis Cache Hot data (2GB) Postgres Primary r5.xlarge (32GB) Read Replica 1 Read Replica 2 MongoDB Menus, Reviews What's New • Read replicas: Distribute read load (5000 orders/day) • Redis: Cache restaurant data (menus change rarely) • MongoDB: Flexible schema for restaurant content • Connection pooling: PgBouncer for 1000+ connections Growth Metrics Queries: 1000/sec | Latency: 15ms | Storage: 50GB | Uptime: 99.9% Cost: $200 RDS + $100 Replicas + $50 Redis + $100 MongoDB = $450/month Algorithms From Our Course B-trees (Postgres) | LRU eviction (Redis) | Hash indexes (MongoDB) | Read-after-write consistency Phase 3: Scale (100K-1M users) "Global Scale" - $5,000/month Google Spanner Global distributed SQL Auto-sharding, 99.999% SLA 1M users, 3 regions BigQuery Analytics warehouse Serverless, pay per query Dataflow (Optional) Streaming ETL Spanner → BigQuery Why Spanner? • No manual sharding needed • Strong consistency globally Spanner Handles For You ✓ Automatic sharding and rebalancing ✓ Synchronous replication across regions ✓ ACID transactions globally Scale Metrics Orders: 50K/day | Peak: 5000/sec | Storage: 1TB | Uptime: 99.999% Cost: $3K Spanner + $1K BigQuery + $500 Dataflow = $4.5K/month Team: 5 engineers (no DBAs needed!) 10K users 100K users Complexity: Simple Moderate Complex (Distributed)

Optional Reading: Quickly Scan Real-World Data Architecture Decisions

Learn from engineering teams at scale - how they chose and evolved their data systems:

From SQL to Distributed Systems

How Uber Serves Over 40 Million Reads Per Second
Uber's journey from MySQL → BigQuery, Spanner, and PrestoSQL. See how they handle massive scale with integrated caching.

Spotify's Data Platform Explained
Deep dive into Spotify's MySQL and BigQuery infrastructure. Learn how they process billions of events daily.

Building and Scaling Notion's Data Lake
How Notion scaled PostgreSQL to handle their collaborative workspace data. Real-world PostgreSQL optimization at scale.

The NoSQL Reality Check

HBase Deprecation at Pinterest
Why Pinterest moved away from NoSQL back to SQL systems. A case study in choosing the right tool for the job.

Why NoSQL Deployments Are Failing at Scale
Industry analysis on NoSQL limitations and why companies are returning to SQL-based systems.