Recovery Post-Crash

Concept. Post-crash recovery runs three passes over the WAL — Analysis (figure out which transactions were in flight at the crash), REDO (replay every committed change to bring the on-disk data back up to the log), and UNDO (reverse the changes of any transaction that didn't COMMIT) — leaving the database fully consistent and ready for new traffic.

Intuition. When the server crashes mid-rush, the log shows Mickey, Minnie, and Daffy were all mid-Premium-purchase. Analysis scans the log and labels them: Mickey COMMITted, Minnie COMMITted, Daffy did not. REDO replays Mickey's and Minnie's writes in case their account-table pages never flushed. UNDO reverses Daffy's charge — his transaction never finished. After three passes, Mickey and Minnie are Premium, Daffy is refunded, and the database is back online.

The Crash Scenario

A database crash leaves us with a few cold facts:

  • Committed transactions might not be fully on disk.

  • Partially complete transactions were left hanging.

  • WAL log holds the full history of changes.

The recovery process needs to:

  1. Analyze the WAL log to figure out what went down.

  2. REDO committed transactions to keep promises.

  3. UNDO incomplete transactions to maintain integrity.

A subtle case: an ABORTed transaction in the WAL still gets REDO'd first, then UNDO'd — the algorithm is uniform. Some databases optimize this REDO-then-UNDO away for ABORTs to save I/O on long transactions.


When Machine (or disk or network) Crashes

Just Before Crash

DB just before crash

Just After Crash

Just after crash

Post-crash reality:

  • Some changes made it to the database (thanks to buffer flushes).

  • Some committed work is missing (lost in RAM).

  • Some incomplete work is on disk but not committed.

  • WAL log is the key to sorting this out.


The Recovery Process

The recovery algorithm in action:

  • Phase 1 - Analysis: Read WAL to build transaction table.

  • Phase 2 - REDO: Replay all changes to reconstruct state.

  • Phase 3 - UNDO: Roll back unfinished transactions.

+ REDO Phase

Purpose: Ensure all finished work survives

Action: Apply changes from WAL using new_values

Order: Forward through the log (oldest to newest)

− UNDO Phase

Purpose: Remove effects of 'unfinished' transactions

Action: Reverse changes using old_values from WAL

Order: Backward through the log (newest to oldest)


After Recovery Complete

Database state after recovery

The final state:

    • All committed transactions are fully applied.
  • − All uncommitted transactions are completely removed.

  • Database is consistent and ready for new transactions.

Why This Works

The WAL is the source of truth!

  • It has the complete history of all changes
  • It has both old and new values for every change
  • It's written durably before any actual data changes
  • Even if the database is corrupted, WAL can rebuild everything

Recovery Guarantees

What Recovery Ensures

  • Atomicity: Transactions are all-or-nothing.

  • Durability: Committed work survives crashes.

  • Consistency: Database rules are preserved.

  • Repeatability: Recovery is deterministic.

What Recovery Costs

  • Time: Can take minutes to hours for large databases.

  • I/O: Must read entire WAL since last checkpoint.

  • Availability: Database offline during recovery.

  • Complexity: Requires careful implementation.


Optimizations

Checkpointing

Idea: Periodically flush all dirty pages to disk

Benefit: Limits how far back recovery must go

Trade-off: Checkpoint overhead vs recovery speed


Practice Questions

Test Your Understanding

Q1: Why do we REDO before UNDO?

Q2: What if the system crashes during recovery?

Q3: How do we know which transactions were active at crash time?