Recovery: Why We Need It

Concept. Recovery is the database's ability to restore a consistent state after a crash, ensuring atomicity (all-or-nothing) and durability (committed work survives) even when the system fails mid-transaction.

Intuition. When Mickey clicks "Buy Premium," the database charges his card, flips is_premium = true, and starts a renewal cycle. Three writes that must all happen or none. If the server crashes after the charge but before the flag flips, recovery has to decide: did Mickey commit (finish the upgrade) or not (refund the charge)? Without recovery he could end up charged but not Premium, or Premium but not charged.

Quick Recall: Transaction Outcomes

+ COMMIT

Transaction completed successfully → All changes become permanent

− ABORT

Transaction failed or cancelled (by user or logic). If a transaction doesn't COMMIT before a crash (machine reboot, disk/network crash), it's also treated as an ABORT.

→ All changes must be undone

The Three Key Problems

Every database must tackle these core issues:

The Atomicity Problem

If a payment transaction fails after debiting account A but before crediting account B, the system risks a partial update where money vanishes.

We must be able to UNDO partial changes when a transaction ABORTs to ensure no data is corrupted.

The Durability Problem

If a user pays for a ticket and receives confirmation, but the server immediately crashes, the system must retain that the transaction occurred.

We must ensure COMMITed data survives crashes. If changes are lost from memory, the system must be able to REDO them on reboot.

The Performance Problem

A database with billions of rows must track changes for millions of concurrent transactions. A naive approach of copying the database before each transaction would cause system paralysis.

The system must be able to meticulously track operations and log changes without imposing massive performance overheads.

Example: DB State after Single Transaction (T56)

Consider a single transaction (T56) that reassigns 3 seats to new buyers: seat 42 ab12 → zx198, seat 44 cd34 → pq342, seat 51 ef56 → st567. The database starts in some Pre state, T56 runs, and the DB ends in either the Post state (COMMIT) or back at the Pre state (ABORT).

FSM: Pre State to T56 transaction to Post State (COMMIT) or back to Pre State (ABORT)

What T56 does: rewrites buyer_id on three seats. 42: ab12→zx198, 44: cd34→pq342, 51: ef56→st567. COMMIT makes those new values durable; ABORT throws them away and the DB stays at Pre.

The Recovery Solution Preview

Modern databases solve this with an elegant approach:

Change Tracking (not copying)

Insight: Track what changed, not entire state

Efficiency: Log 10 changes vs. copy 10 million rows

Write-Ahead Logging

Insight: Write changes to fast sequential log before slow random storage

Guarantee: Can reconstruct any state from logged changes