Search

Key concepts and definitions across all sections.

Act I: Foundations

M1 How to write basic SQL queries

Spotify Schema: Spotify Schema

Basic SQL: SELECT • FROM • WHERE

NULL Handling: NULL • Three-Valued Logic • IS NULL

Joins: INNER JOIN • LEFT JOIN • CROSS JOIN • Self Join

Aggregations: GROUP BY • HAVING • COUNT • SUM

Subqueries: Uncorrelated Subquery • Correlated Subquery • EXISTS • IN • NULL in NOT IN

M1B How to read, write, and debug intermediate queries

Common Table Expressions: CTE (Common Table Expression) • WITH Clause

Window Functions: Window Function • PARTITION BY • ROW_NUMBER • RANK • DENSE_RANK

Debug Tables: Debug Tables

Query Equivalence: Query Equivalence • Output Equivalence

LLMs and SQL: LLMs and Semantics

Reading Queries: Reading Queries

Writing Queries: The Funnel • The Ladder • The Timeline • The Comparison • The Exception Finder

Parallel Execution: Parallel Execution

M2 How does physical hardware limit software?

IO: Storage Hierarchy • IO Cost

Pages & Blocks: Sequential vs Random IO

M2 How do we manipulate data efficiently?

Hashing: Locality Sensitive Hashing (LSH) • H3 (Geospatial Index) • Vector/Embedding Hashing

Bloom Filters: Bloom Filter • False Positive Rate

Compression: Run-Length Encoding (RLE) • Dictionary Encoding • Delta Encoding • Bit Packing

Act II: Building Data Systems

M3 How is data physically organized on disk?

Storage: Row Storage • Columnar Storage • Page

M3 How do databases physically execute queries?

Algorithms: Hash Partitioning • BigSort • Block Nested Loop Join (BNLJ) • Hash Partition Join (HPJ) • Sort-Merge Join (SMJ)

M3B How do databases find data quickly?

Indexing: B+ Tree • LSM Tree • SSTable • Compaction

M3C How does the database choose the fastest path?

Query Optimization: Query Optimizer • Query Plan • Selectivity • Cardinality • IO Cost Summary (Plans)

M4 What guarantees does a database make?

Motivation: Transaction • Scale • Complexity

ACID Properties: ACID • Atomicity • Consistency • Isolation • Durability

M4 How do we handle concurrent users?

Building Transactions: Serial Schedule • Concurrent Schedule • Transaction Manager

Transaction Scheduling: Reordering Rules • Interleaved Schedule

Concurrency Primer: Concurrency • Parallelism • Optimistic Concurrency Control (OCC) • Pessimistic Concurrency Control (PCC) • Multi-Version Concurrency Control (MVCC)

Lock Foundations: Shared Lock (S-Lock) • Exclusive Lock (X-Lock) • Lock Compatibility Matrix

Two-Phase Locking (2PL): Two-Phase Locking (2PL) • Strict Two-Phase Locking (S2PL) • Cascading Abort • Growing Phase • Shrinking Phase

Microschedules: Microschedule • WaitsFor Graph • Deadlock • Cycle

M4 Proving correctness.

Correctness: Conflict • Read-Write Conflict • Write-Read Conflict • Write-Write Conflict • Conflict Serializability • Conflict Graph • Macroschedules • Topological Sort

Connecting Together: 2PL Theorem

M4B What happens when the power goes out?

Recovery Concepts: UNDO • REDO • COMMIT • ABORT

Write-Ahead Logging (WAL): Write-Ahead Logging (WAL) • Trace Tables • WAL Protocol Rules

M4C What happens when the power goes out?

Post-Crash Recovery: Analysis Phase • Redo Phase • Undo Phase • Checkpoint

Act III: Designing Data Systems

M5 How do we scale to 1,000 machines?

Scalability: Sharding Strategies • Replication Patterns

Algorithms: Consistent Hashing

M5 What happens when machines inevitably fail?

Compute Frameworks: MapReduce

Spark: Apache Spark • RDDs • Spark SQL • Lineage

M5 How do we handle real-time data?

Kafka: Kafka Architecture • Topics & Partitions • Delivery Semantics • Ordering Guarantees

M5 What are the physical limits of scaling?

Theory: CAP Theorem

M6 How does Big Tech scale data pipelines?

Data Architecture: Data Lake • OLTP • OLAP • ETL • ELT

M6 When is SQL the wrong tool?

Python & Pandas: Pandas • Polars

SQL vs NoSQL: NoSQL • JSONB

M6 How do we protect user data at scale?

Data Privacy: Differential Privacy • Synthetic Data