Search
Key concepts and definitions across all sections.
Act I: Foundations
M1 How to write basic SQL queries
Spotify Schema:
Spotify Schema
NULL Handling:
NULL • Three-Valued Logic • IS NULL
Joins:
INNER JOIN • LEFT JOIN • CROSS JOIN • Self Join
Subqueries:
Uncorrelated Subquery • Correlated Subquery • EXISTS • IN • NULL in NOT IN
M1B How to read, write, and debug intermediate queries
Common Table Expressions:
CTE (Common Table Expression) • WITH Clause
Window Functions:
Window Function • PARTITION BY • ROW_NUMBER • RANK • DENSE_RANK
Debug Tables:
Debug Tables
Query Equivalence:
Query Equivalence • Output Equivalence
LLMs and SQL:
LLMs and Semantics
Reading Queries:
Reading Queries
Writing Queries:
The Funnel • The Ladder • The Timeline • The Comparison • The Exception Finder
Parallel Execution:
Parallel Execution
M2 How does physical hardware limit software?
IO:
Storage Hierarchy • IO Cost
Pages & Blocks:
Sequential vs Random IO
M2 How do we manipulate data efficiently?
Bloom Filters:
Bloom Filter • False Positive Rate
Compression:
Run-Length Encoding (RLE) • Dictionary Encoding • Delta Encoding • Bit Packing
Act II: Building Data Systems
M3 How is data physically organized on disk?
Storage:
Row Storage • Columnar Storage • Page
M3 How do databases physically execute queries?
Algorithms:
Hash Partitioning • BigSort • Block Nested Loop Join (BNLJ) • Hash Partition Join (HPJ) • Sort-Merge Join (SMJ)
M3B How do databases find data quickly?
Indexing:
B+ Tree • LSM Tree • SSTable • Compaction
M3C How does the database choose the fastest path?
Query Optimization:
Query Optimizer • Query Plan • Selectivity • Cardinality • IO Cost Summary (Plans)
M4 What guarantees does a database make?
Motivation:
Transaction • Scale • Complexity
ACID Properties:
ACID • Atomicity • Consistency • Isolation • Durability
M4 How do we handle concurrent users?
Building Transactions:
Serial Schedule • Concurrent Schedule • Transaction Manager
Transaction Scheduling:
Reordering Rules • Interleaved Schedule
Concurrency Primer:
Concurrency • Parallelism • Optimistic Concurrency Control (OCC) • Pessimistic Concurrency Control (PCC) • Multi-Version Concurrency Control (MVCC)
Lock Foundations:
Shared Lock (S-Lock) • Exclusive Lock (X-Lock) • Lock Compatibility Matrix
Two-Phase Locking (2PL):
Two-Phase Locking (2PL) • Strict Two-Phase Locking (S2PL) • Cascading Abort • Growing Phase • Shrinking Phase
Microschedules:
Microschedule • WaitsFor Graph • Deadlock • Cycle
M4 Proving correctness.
Correctness:
Conflict • Read-Write Conflict • Write-Read Conflict • Write-Write Conflict • Conflict Serializability • Conflict Graph • Macroschedules • Topological Sort
Connecting Together:
2PL Theorem
M4B What happens when the power goes out?
Write-Ahead Logging (WAL):
Write-Ahead Logging (WAL) • Trace Tables • WAL Protocol Rules
M4C What happens when the power goes out?
Post-Crash Recovery:
Analysis Phase • Redo Phase • Undo Phase • Checkpoint
Act III: Designing Data Systems
M5 How do we scale to 1,000 machines?
Scalability:
Sharding Strategies • Replication Patterns
Algorithms:
Consistent Hashing
M5 What happens when machines inevitably fail?
Compute Frameworks:
MapReduce
Spark:
Apache Spark • RDDs • Spark SQL • Lineage
M5 How do we handle real-time data?
M5 What are the physical limits of scaling?
Theory:
CAP Theorem
M6 How does Big Tech scale data pipelines?
M6 When is SQL the wrong tool?
M6 How do we protect user data at scale?
Data Privacy:
Differential Privacy • Synthetic Data