Recovery Post-Crash 💥

🔄 Bringing the Database Back to Life

How databases recover from crashes using the WAL log to restore consistency

The moment of truth: UNDO incomplete transactions, REDO committed ones


The Crash Scenario 🚨

When a database crashes, it leaves behind:

The recovery process must:

  1. Analyze the WAL log to understand what happened

  2. REDO finished (committed) transactions to ensure durability.

  3. UNDO incomplete transactions to ensure atomicity

[Optional]: Detailed Notes on Steps 2 and 3

In Step 2: we just repeat every single action in the WAL from top to bottom (before the DB crashed). Cases to consider:

In Step 3: for the transactions with no COMMIT/ABORT in the WAL, we treat them as ABORTs, we will UNDO that transaction's changes.

Potential Optimizations: some DBs will optimize the REDO-then-UNDO step for ABORT transactions. (needs some good algorithms for performance reasons). For example, if the transaction is a long-running transaction for 1 million variables, we'll be smarter about REDOING and UNDOING the changes rather than waste IOs blindly.


When Machine (or disk or network) Crashes

Just Before Crash

DB just before crash

Just After Crash

Just after crash

What we see after crash:

🔍 Analysis Phase

Goal: Figure out which transactions were active at crash
Method: Scan WAL
Result: List of committed vs uncommitted transactions

The Recovery Process

The recovery algorithm in action:

✅ REDO Phase

Purpose: Ensure all finished work survives
Action: Apply changes from WAL using new_values
Order: Forward through the log (oldest to newest)

❌ UNDO Phase

Purpose: Remove effects of 'unfinished' transactions
Action: Reverse changes using old_values from WAL
Order: Backward through the log (newest to oldest)

After Recovery Complete

Database state after recovery

The final state:

💡 Why This Works

The WAL is the source of truth!

  • It has the complete history of all changes
  • It has both old and new values for every change
  • It's written durably before any actual data changes
  • Even if the database is corrupted, WAL can rebuild everything

Recovery Guarantees 🛡️

What Recovery Ensures

  • Atomicity: Transactions are all-or-nothing
  • Durability: Committed work survives crashes
  • Consistency: Database rules are preserved
  • Repeatability: Recovery is deterministic

What Recovery Costs

  • Time: Can take minutes to hours for large databases
  • I/O: Must read entire WAL since last checkpoint
  • Availability: Database offline during recovery
  • Complexity: Requires careful implementation

Optimizations 🚀

⚡ Checkpointing

Idea: Periodically flush all dirty pages to disk
Benefit: Limits how far back recovery must go
Trade-off: Checkpoint overhead vs recovery speed

Practice Questions 🤔

💭 Test Your Understanding

Q1: Why do we REDO before UNDO?
Q2: What if the system crashes during recovery?
Q3: How do we know which transactions were active at crash time?