Recovery Post-Crash
Bringing the Database Back to Life
How databases recover from crashes using the WAL log to restore consistency
The moment of truth: UNDO incomplete transactions, REDO committed ones
The Crash Scenario
A database crash leaves us with a few cold facts:
-
Committed transactions might not be fully on disk.
-
Partially complete transactions were left hanging.
-
WAL log holds the full history of changes.
The recovery process needs to:
-
Analyze the WAL log to figure out what went down.
-
REDO committed transactions to keep promises.
-
UNDO incomplete transactions to maintain integrity.
Detailed Notes on Steps 2 and 3
In Step 2: Replay every action in the WAL from start to crash. Consider these scenarios:
-
If the WAL shows a COMMIT, redo the transaction.
-
If the WAL shows an ABORT, redo the transaction then undo it.
In Step 3: Transactions without COMMIT/ABORT are treated as ABORTs, and we undo their changes.
Potential Optimizations: Some databases optimize the REDO-then-UNDO for ABORTs. Efficient algorithms can save on I/O, especially for long transactions involving millions of variables.
When Machine (or disk or network) Crashes
Just Before Crash

Just After Crash

Post-crash reality:
-
Some changes made it to the database (thanks to buffer flushes).
-
Some committed work is missing (lost in RAM).
-
Some incomplete work is on disk but not committed.
-
WAL log is the key to sorting this out.
Analysis Phase
The Recovery Process
The recovery algorithm in action:
-
Phase 1 - Analysis: Read WAL to build transaction table.
-
Phase 2 - REDO: Replay all changes to reconstruct state.
-
Phase 3 - UNDO: Roll back unfinished transactions.
✅ REDO Phase
❌ UNDO Phase
After Recovery Complete

The final state:
-
✅ All committed transactions are fully applied.
-
❌ All uncommitted transactions are completely removed.
-
Database is consistent and ready for new transactions.
Why This Works
The WAL is the source of truth!
- It has the complete history of all changes
- It has both old and new values for every change
- It's written durably before any actual data changes
- Even if the database is corrupted, WAL can rebuild everything
Recovery Guarantees
What Recovery Ensures
-
Atomicity: Transactions are all-or-nothing.
-
Durability: Committed work survives crashes.
-
Consistency: Database rules are preserved.
-
Repeatability: Recovery is deterministic.
What Recovery Costs
-
Time: Can take minutes to hours for large databases.
-
I/O: Must read entire WAL since last checkpoint.
-
Availability: Database offline during recovery.
-
Complexity: Requires careful implementation.