CS145 Logo

CS 145

Fall 2026 • Intro to Big Data Systems

Prep For

Curated journeys through the existing course material. Pick the path that matches your goal — the exam, a SQL interview next week, a systems-design loop, or going deeper after the term ends. Steps are ordered: walk them top to bottom.

Path 1 Minimum Viable Exam Prep ≈ 10–14 hrs

The straight line from "haven't studied" to "ready to sit the exam." Quizzes first to find your gaps, then P-sets for the mechanics, then CA section walkthroughs to see solved examples, then re-read the concept summaries.

Path 2 Data / SQL Interview Prep ≈ 3–5 hrs

For SQL screens and analytics-engineer interviews. Start with the patterns interviewers actually test (Round 1 = single pattern, Round 2 = composition), then drill on the playground, then read how Postgres scales in the wild for the inevitable "how would you scale this?" follow-up.

Path 3 Systems Design Interview Prep ≈ 6–10 hrs

For data-infra, backend, and platform-engineering interviews. Start with the IO cost model (the substrate every systems answer rests on), then index selection, then a real-world transaction disaster, then distributed primitives. Each step is a concrete pattern you can name in an interview.

Path 4 Going Deeper — Reading List post-class

For when class is over and you want to keep going. Three columns: the foundational papers that built the field, the in-course case studies grouped by topic, and the modern systems shaping where things go next.

Foundational papers & systems
M5
Case Study: GFS
sharding · replication · the paper that started "big data"
M5
MapReduce — Google (2004)
hash partitioning · distributed sort · fault tolerance
M5
Apache Spark — RDDs & in-memory compute
MapReduce evolved · why disk I/O was the bottleneck
M5
Apache Kafka — real-time event streaming
partitioning · producers/consumers · delivery semantics
M5
Distributed file systems
HDFS · Colossus · Google Cloud Storage architecture
M5
Consensus — Paxos, Raft, Zookeeper
how systems agree on a single truth
M5
CAP theorem — Dynamo, Cassandra
consistent hashing · the eternal tradeoff
Classic papers (external)
Paper
Dremel — Google (2010)
column-store query engine · the BigQuery ancestor
Paper
Spanner — Google (2012)
globally distributed transactions · TrueTime · external consistency
Paper
ClickHouse — VLDB (2024)
columnar OLAP · single-machine performance that embarrassed warehouses
Paper
Bigtable — Google (2006)
sparse distributed storage · the LSM lineage
Paper
Snowflake Elastic Data Warehouse — SIGMOD (2016)
separated storage + compute · the cloud DW playbook
Industry case studies
M0
Case Study 0.1: Small Memory
Pandas, Polars, the RAM ceiling, and why we reach for SQL
M0
Case Study 0.2: Memory for UberEats
three-sided marketplace · transactional consistency · query patterns at scale
M1B
Case Study 1.2: OpenAI & Postgres
scaling reads · indexing strategy
M2
Case Study 2.1: How Uber and Maps use Geo-hashing
Redis · S2 geometry · spatial indexing
M2
Case Study 2.2: How OpenAI uses Vector DB
pgvector first · FAISS on GPU · Pinecone managed · when to reach for a specialized store
M2
Case Study 2.3: How Chrome Safe Browsing uses Bloom Filters
Cassandra · BigTable · HBase · Kafka
M3
Case Study 3.1: BigQuery Scaling
distributed query execution · columnar storage
M3B
Case Study 3.2: Spotify Search
Lucene · inverted indexes · search engine internals
M3B
Case Study 3.3: Spotify Activity
RocksDB · LevelDB · Cassandra memtable/SSTable
M3B
Case Study 3.4: Spotify Wrapped
combining index types for different access patterns
M3B
Problem Solving: $550k Query Disaster
real query plan optimization failures
M4
Case Study 4.1: ACID at Stripe
isolation levels · deadlocks · MVCC
Data systems & modern stack
M0
Case Study 0.3: Memory in Your Pocket
SQLite · single-file embedded DB · the database in every app on your phone
M0
Case Study 0.4: Memory for AI Agents
embedded SQLite · context across sessions · how agents avoid rediscovery
M6
6.1 · DB Design for a Startup
small team · move fast · what to optimize for
M6
6.2 · SQL's Complex Types
JSONB, vectors, geo, semi-structured: one DB, four worlds
M6
6.3 · DB Design for Big Tech Scale
multi-team · multi-region · regulated
M6
Case Study 6.4: SQL vs NoSQL
the scale dilemma · ACID vs horizontal scale · modern distributed SQL
M6
Case Study 6.5: Key-Value Stores
Memcached · Redis · DynamoDB
M6
Case Study 6.6: Privacy
Google COVID reports · differential privacy · re-identification risk
M6
Modern Big Data Economics
lakehouse · Iceberg · Delta · the economics of scale
Industry blogs (external)
Blog
Notion — building & scaling the data lake
how a doc app ended up with a petabyte-scale lake
Blog
Pinterest — deprecating HBase
a decade-long migration · what they replaced it with and why
Blog
Uber — 40M reads/sec with an integrated cache
cache + DB as one system · the read-throughput playbook
Blog
Spotify — the data platform explained
end-to-end view from event ingestion to analytics
Blog
Why NoSQL deployments are failing at scale
post-mortem patterns · the limits of "schemaless"