Recovery Post-Crash

Bringing the Database Back to Life

How databases recover from crashes using the WAL log to restore consistency

The moment of truth: UNDO incomplete transactions, REDO committed ones


The Crash Scenario

A database crash leaves us with a few cold facts:

The recovery process needs to:

  1. Analyze the WAL log to figure out what went down.

  2. REDO committed transactions to keep promises.

  3. UNDO incomplete transactions to maintain integrity.

Detailed Notes on Steps 2 and 3

In Step 2: Replay every action in the WAL from start to crash. Consider these scenarios:

In Step 3: Transactions without COMMIT/ABORT are treated as ABORTs, and we undo their changes.

Potential Optimizations: Some databases optimize the REDO-then-UNDO for ABORTs. Efficient algorithms can save on I/O, especially for long transactions involving millions of variables.


When Machine (or disk or network) Crashes

Just Before Crash

DB just before crash

Just After Crash

Just after crash

Post-crash reality:

Analysis Phase

Goal: Identify active transactions at crash
Method: Scan WAL
Result: List of committed vs uncommitted transactions

The Recovery Process

The recovery algorithm in action:

✅ REDO Phase

Purpose: Ensure all finished work survives
Action: Apply changes from WAL using new_values
Order: Forward through the log (oldest to newest)

❌ UNDO Phase

Purpose: Remove effects of 'unfinished' transactions
Action: Reverse changes using old_values from WAL
Order: Backward through the log (newest to oldest)

After Recovery Complete

Database state after recovery

The final state:

Why This Works

The WAL is the source of truth!

  • It has the complete history of all changes
  • It has both old and new values for every change
  • It's written durably before any actual data changes
  • Even if the database is corrupted, WAL can rebuild everything

Recovery Guarantees

What Recovery Ensures

What Recovery Costs


Optimizations

Checkpointing

Idea: Periodically flush all dirty pages to disk
Benefit: Limits how far back recovery must go
Trade-off: Checkpoint overhead vs recovery speed

Practice Questions

Test Your Understanding

Q1: Why do we REDO before UNDO?
Q2: What if the system crashes during recovery?
Q3: How do we know which transactions were active at crash time?