← Back to Course Home

Search

Key concepts and definitions across all sections.

M1 Module 1: SQL

How do I query relational data?

Basic SQL: SELECTFROMWHERE
Joins: INNER JOINLEFT JOINCROSS JOINSelf Join
Aggregations: GROUP BYHAVINGCOUNTSUM
Common Table Expressions: CTE (Common Table Expression)WITH Clause
Window Functions: Window FunctionPARTITION BYROW_NUMBERRANKDENSE_RANK
Subqueries: Uncorrelated SubqueryCorrelated SubqueryEXISTSINNULL in NOT IN
NULL Handling: NULLThree-Valued LogicIS NULL

How do I know my queries are correct?

LLMs and SQL: LLMs and Semantics
Query Equivalence: Query EquivalenceOutput Equivalence
Debug Tables: Debug Tables

How do I solve intermediate-level SQL questions?

SQL Patterns: The FunnelThe LadderThe TimelineThe ComparisonThe Exception Finder

M2 Module 2: Systems

How does physical hardware limit software?

IO: Storage HierarchyIO Cost
Pages & Blocks: Sequential vs Random IO

How do we manipulate data efficiently?

Hashing: Locality Sensitive Hashing (LSH)H3 (Geospatial Index)Vector/Embedding Hashing
Bloom Filters: Bloom FilterFalse Positive Rate
Compression: Run-Length Encoding (RLE)Dictionary EncodingDelta EncodingBit Packing

M3 Module 3: nanoDB

How is data physically organized on disk?

Storage: Row StorageColumnar StoragePage

How do databases physically execute queries?

Algorithms: Hash PartitioningBigSortBlock Nested Loop Join (BNLJ)Hash Partition Join (HPJ)Sort-Merge Join (SMJ)

How do databases find data quickly?

Indexing: B+ TreeLSM TreeSSTableCompaction

How does the database choose the fastest path?

Query Optimization: Query OptimizerQuery PlanSelectivityCardinalityIO Cost Summary (Plans)

M4 Module 4: Transactions

What guarantees does a database make?

Motivation: TransactionScaleComplexity
ACID Properties: ACIDAtomicityConsistencyIsolationDurability

How do we handle concurrent users?

Building Transactions: Serial ScheduleConcurrent ScheduleTransaction Manager
Transaction Scheduling: Reordering RulesInterleaved Schedule
Concurrency Primer: ConcurrencyParallelismOptimistic Concurrency Control (OCC)Pessimistic Concurrency Control (PCC)Multi-Version Concurrency Control (MVCC)
Lock Foundations: Shared Lock (S-Lock)Exclusive Lock (X-Lock)Lock Compatibility Matrix
Two-Phase Locking (2PL): Two-Phase Locking (2PL)Strict Two-Phase Locking (S2PL)Cascading AbortGrowing PhaseShrinking Phase
Microschedules: MicroscheduleWaitsFor GraphDeadlockCycle

Proving correctness.

Correctness: ConflictRead-Write ConflictWrite-Read ConflictWrite-Write ConflictConflict SerializabilityConflict GraphMacroschedulesTopological Sort
Connecting Together: 2PL Theorem

What happens when the power goes out?

Recovery Concepts: UNDOREDOCOMMITABORT
Write-Ahead Logging (WAL): Write-Ahead Logging (WAL)Trace TablesWAL Protocol Rules
Post-Crash Recovery: Analysis PhaseRedo PhaseUndo PhaseCheckpoint

M5 Module 5: Distributed

How do we scale to 1,000 machines?

Scalability: Sharding StrategiesReplication Patterns
Algorithms: Consistent Hashing

What happens when machines inevitably fail?

Compute Frameworks: MapReduce
Spark: Apache SparkRDDsSpark SQLLineage

How do we handle real-time data?

Kafka: Kafka ArchitectureTopics & PartitionsDelivery SemanticsOrdering Guarantees

What are the physical limits of scaling?

Theory: CAP Theorem

M6 Module 6: Data Systems

How does Big Tech scale data pipelines?

Data Architecture: Data LakeOLTPOLAPETLELT

When is SQL the wrong tool?

Python & Pandas: PandasPolars
SQL vs NoSQL: NoSQLJSONB

How do we protect user data at scale?

Data Privacy: Differential PrivacySynthetic Data