CS145 Logo

CS 145

Fall 2026 • Intro to Big Data Systems

Prep For

Curated journeys through the existing course material. Pick the path that matches your goal — the exam, a SQL interview next week, a systems-design loop, or going deeper after the term ends. Steps are ordered: walk them top to bottom.

Path 1 Exam Prep ≈ 10–14 hrs

The straight line from "haven't studied" to "ready to sit the exam." Quizzes first to find your gaps, then P-sets for the mechanics, then CA section walkthroughs to see solved examples, then re-read the concept summaries.

Path 2 Data / SQL Interview Prep ≈ 3–5 hrs

For SQL screens and analytics-engineer interviews. Start with the patterns interviewers actually test (Round 1 = single pattern, Round 2 = composition), then drill on the playground, then read how Postgres scales in the wild for the inevitable "how would you scale this?" follow-up.

Path 3 Systems Design Interview Prep ≈ 6–10 hrs

For data-infra, backend, and platform-engineering interviews. Start with the IO cost model (the substrate every systems answer rests on), then index selection, then a real-world transaction disaster, then distributed primitives. Each step is a concrete pattern you can name in an interview.

Path 4 Going Deeper — Reading List post-class

For when class is over and you want to keep going. Three columns: the foundational papers that built the field, the in-course case studies grouped by topic, and the modern systems shaping where things go next.

Foundational papers & systems
M5
The Google File System (2003)
sharding · replication · the paper that started "big data"
M5
MapReduce — Google (2004)
hash partitioning · distributed sort · fault tolerance
M5
Apache Spark — RDDs & in-memory compute
MapReduce evolved · why disk I/O was the bottleneck
M5
Apache Kafka — real-time event streaming
partitioning · producers/consumers · delivery semantics
M5
Distributed file systems
HDFS · Colossus · Google Cloud Storage architecture
M5
Consensus — Paxos, Raft, Zookeeper
how systems agree on a single truth
M5
CAP theorem — Dynamo, Cassandra
consistent hashing · the eternal tradeoff
Classic papers (external)
Paper
Dremel — Google (2010)
column-store query engine · the BigQuery ancestor
Paper
Spanner — Google (2012)
globally distributed transactions · TrueTime · external consistency
Paper
ClickHouse — VLDB (2024)
columnar OLAP · single-machine performance that embarrassed warehouses
Paper
Bigtable — Google (2006)
sparse distributed storage · the LSM lineage
Paper
Snowflake Elastic Data Warehouse — SIGMOD (2016)
separated storage + compute · the cloud DW playbook
Industry case studies
M1
OpenAI on Postgres — 800M users
scaling reads · indexing strategy
M1
UberEats schema design
real-world schema · query patterns at scale
M2
Bloom filters in production
Cassandra · BigTable · HBase · Kafka
M2
Geohashing for location services
Redis · S2 geometry · spatial indexing
M6
PostgreSQL with four different data types
geographic · genomic · JSON documents · AI vectors — one DB, four worlds
M2
Vector DBs — pgvector, FAISS & Pinecone
pgvector first · FAISS on GPU · Pinecone managed · when to reach for a specialized store
M3
BigQuery — column-oriented at scale
distributed query execution · columnar storage
M3B
Spotify search — Lucene & inverted indexes
LSM trees · search engine internals
M3B
LSM trees in Spotify
RocksDB · LevelDB · Cassandra memtable/SSTable
M3B
Query disaster — when plans go wrong
real query plan optimization failures
M3B
Wrapped — hybrid indexing strategy
combining index types for different access patterns
M4
Real-world transaction patterns
isolation levels · deadlocks · MVCC
Data systems & modern stack
M6
SQLite — for mobile
embedded DB · single-file · the database in every app on your phone
M6
Polars / Pandas / DataFrames
in-memory analytics · vectorized ops · the Rust-powered Pandas successor
M6
NoSQL design patterns
MongoDB · DynamoDB · Redis
M6
Key-value stores
Memcached · Redis · DynamoDB
M6
Data privacy & compliance
GDPR · encryption · PII handling
M6
SQL trends — where things go next
lakehouse · Iceberg · Delta · modern features
M6
SQL beyond tables
JSON · nested types · time-series · semi-structured
M6
Database design — startup edition
small team · move fast · what to optimize for
M6
Database design — big tech edition
multi-team · multi-region · regulated
Industry blogs (external)
Blog
Notion — building & scaling the data lake
how a doc app ended up with a petabyte-scale lake
Blog
Pinterest — deprecating HBase
a decade-long migration · what they replaced it with and why
Blog
Uber — 40M reads/sec with an integrated cache
cache + DB as one system · the read-throughput playbook
Blog
Spotify — the data platform explained
end-to-end view from event ingestion to analytics
Blog
Why NoSQL deployments are failing at scale
post-mortem patterns · the limits of "schemaless"