Stanford · CS 145 · Fall 2026

CS 145

Fall 2026 • Intro to Big Data Systems
← Back to Course Home

Search

Key concepts and definitions across all sections.

Act I: Foundations

M1 How to write basic SQL queries

Spotify Schema: Spotify Schema
Basic SQL: SELECTFROMWHERE
NULL Handling: NULLThree-Valued LogicIS NULL
Joins: INNER JOINLEFT JOINCROSS JOINSelf Join
Aggregations: GROUP BYHAVINGCOUNTSUM
Subqueries: Uncorrelated SubqueryCorrelated SubqueryEXISTSINNULL in NOT IN

M1B How to read, write, and debug intermediate queries

Common Table Expressions: CTE (Common Table Expression)WITH Clause
Window Functions: Window FunctionPARTITION BYROW_NUMBERRANKDENSE_RANK
Debug Tables: Debug Tables
Query Equivalence: Query EquivalenceOutput Equivalence
LLMs and SQL: LLMs and Semantics
Reading Queries: Reading Queries
Writing Queries: The FunnelThe LadderThe TimelineThe ComparisonThe Exception Finder
Parallel Execution: Parallel Execution

M2 How does physical hardware limit software?

IO: Storage HierarchyIO Cost
Pages & Blocks: Sequential vs Random IO

M2 How do we manipulate data efficiently?

Hashing: Locality Sensitive Hashing (LSH)H3 (Geospatial Index)Vector/Embedding Hashing
Bloom Filters: Bloom FilterFalse Positive Rate
Compression: Run-Length Encoding (RLE)Dictionary EncodingDelta EncodingBit Packing

Act II: Building Data Systems

M3 How is data physically organized on disk?

Storage: Row StorageColumnar StoragePage

M3 How do databases physically execute queries?

Algorithms: Hash PartitioningBigSortBlock Nested Loop Join (BNLJ)Hash Partition Join (HPJ)Sort-Merge Join (SMJ)

M3B How do databases find data quickly?

Indexing: B+ TreeLSM TreeSSTableCompaction

M3C How does the database choose the fastest path?

Query Optimization: Query OptimizerQuery PlanSelectivityCardinalityIO Cost Summary (Plans)

M4 What guarantees does a database make?

Motivation: TransactionScaleComplexity
ACID Properties: ACIDAtomicityConsistencyIsolationDurability

M4 How do we handle concurrent users?

Building Transactions: Serial ScheduleConcurrent ScheduleTransaction Manager
Transaction Scheduling: Reordering RulesInterleaved Schedule
Concurrency Primer: ConcurrencyParallelismOptimistic Concurrency Control (OCC)Pessimistic Concurrency Control (PCC)Multi-Version Concurrency Control (MVCC)
Lock Foundations: Shared Lock (S-Lock)Exclusive Lock (X-Lock)Lock Compatibility Matrix
Two-Phase Locking (2PL): Two-Phase Locking (2PL)Strict Two-Phase Locking (S2PL)Cascading AbortGrowing PhaseShrinking Phase
Microschedules: MicroscheduleWaitsFor GraphDeadlockCycle

M4 Proving correctness.

Correctness: ConflictRead-Write ConflictWrite-Read ConflictWrite-Write ConflictConflict SerializabilityConflict GraphMacroschedulesTopological Sort
Connecting Together: 2PL Theorem

M4B What happens when the power goes out?

Recovery Concepts: UNDOREDOCOMMITABORT
Write-Ahead Logging (WAL): Write-Ahead Logging (WAL)Trace TablesWAL Protocol Rules

M4C What happens when the power goes out?

Post-Crash Recovery: Analysis PhaseRedo PhaseUndo PhaseCheckpoint

Act III: Designing Data Systems

M5 How do we scale to 1,000 machines?

Scalability: Sharding StrategiesReplication Patterns
Algorithms: Consistent Hashing

M5 What happens when machines inevitably fail?

Compute Frameworks: MapReduce
Spark: Apache SparkRDDsSpark SQLLineage

M5 How do we handle real-time data?

Kafka: Kafka ArchitectureTopics & PartitionsDelivery SemanticsOrdering Guarantees

M5 What are the physical limits of scaling?

Theory: CAP Theorem

M6 How does Big Tech scale data pipelines?

Data Architecture: Data LakeOLTPOLAPETLELT

M6 When is SQL the wrong tool?

Python & Pandas: PandasPolars
SQL vs NoSQL: NoSQLJSONB

M6 How do we protect user data at scale?

Data Privacy: Differential PrivacySynthetic Data