Recovery Post-Crash

Bringing the Database Back to Life

How databases recover from crashes using the WAL log to restore consistency

The moment of truth: UNDO incomplete transactions, REDO committed ones

The Crash Scenario

A database crash leaves us with a few cold facts:

Committed transactions might not be fully on disk.
Partially complete transactions were left hanging.
WAL log holds the full history of changes.

The recovery process needs to:

Analyze the WAL log to figure out what went down.
REDO committed transactions to keep promises.
UNDO incomplete transactions to maintain integrity.

Detailed Notes on Steps 2 and 3

In Step 2: Replay every action in the WAL from start to crash. Consider these scenarios:

If the WAL shows a COMMIT, redo the transaction.
If the WAL shows an ABORT, redo the transaction then undo it.

In Step 3: Transactions without COMMIT/ABORT are treated as ABORTs, and we undo their changes.

Potential Optimizations: Some databases optimize the REDO-then-UNDO for ABORTs. Efficient algorithms can save on I/O, especially for long transactions involving millions of variables.

When Machine (or disk or network) Crashes

Just Before Crash

DB just before crash

Just After Crash

Just after crash

Post-crash reality:

Some changes made it to the database (thanks to buffer flushes).
Some committed work is missing (lost in RAM).
Some incomplete work is on disk but not committed.
WAL log is the key to sorting this out.

Analysis Phase

Goal: Identify active transactions at crash

Method: Scan WAL

Result: List of committed vs uncommitted transactions

The Recovery Process

The recovery algorithm in action:

Phase 1 - Analysis: Read WAL to build transaction table.
Phase 2 - REDO: Replay all changes to reconstruct state.
Phase 3 - UNDO: Roll back unfinished transactions.

✅ REDO Phase

Purpose: Ensure all finished work survives

Action: Apply changes from WAL using new_values

Order: Forward through the log (oldest to newest)

❌ UNDO Phase

Purpose: Remove effects of 'unfinished' transactions

Action: Reverse changes using old_values from WAL

Order: Backward through the log (newest to oldest)

After Recovery Complete

Database state after recovery

The final state:

✅ All committed transactions are fully applied.
❌ All uncommitted transactions are completely removed.
Database is consistent and ready for new transactions.

Why This Works

The WAL is the source of truth!

It has the complete history of all changes
It has both old and new values for every change
It's written durably before any actual data changes
Even if the database is corrupted, WAL can rebuild everything

Recovery Guarantees

What Recovery Ensures

Atomicity: Transactions are all-or-nothing.
Durability: Committed work survives crashes.
Consistency: Database rules are preserved.
Repeatability: Recovery is deterministic.

What Recovery Costs

Time: Can take minutes to hours for large databases.
I/O: Must read entire WAL since last checkpoint.
Availability: Database offline during recovery.
Complexity: Requires careful implementation.

Optimizations

Checkpointing

Idea: Periodically flush all dirty pages to disk

Benefit: Limits how far back recovery must go

Trade-off: Checkpoint overhead vs recovery speed

Practice Questions

Test Your Understanding

Q1: Why do we REDO before UNDO?

Q2: What if the system crashes during recovery?

Q3: How do we know which transactions were active at crash time?