Recovery: Why We Need It

Quick Recall: Transaction Outcomes

✅ COMMIT

Transaction completed successfully → All changes become permanent

❌ ABORT

Transaction failed or cancelled (by user or logic). If a transaction doesn't COMMIT before a crash (machine reboot, disk/network crash), it's also treated as an ABORT.

→ All changes must be undone

The Three Key Problems

Every database must tackle these core issues:

The Atomicity Problem

If a payment transaction fails after debiting account A but before crediting account B, the system risks a partial update where money vanishes.
We must be able to UNDO partial changes when a transaction ABORTs to ensure no data is corrupted.

The Durability Problem

If a user pays for a ticket and receives confirmation, but the server immediately crashes, the system must retain that the transaction occurred.
We must ensure COMMITed data survives crashes. If changes are lost from memory, the system must be able to REDO them on reboot.

The Performance Problem

A database with billions of rows must track changes for millions of concurrent transactions. A naive approach of copying the database before each transaction would cause system paralysis.
The system must be able to meticulously track operations and log changes without imposing massive performance overheads.

Recovery is the database's ability to restore itself to a consistent state. It's like having a time machine.


Real-World Examples

Banking Transfer Gone Wrong

Problem: Alice lost $500, Bob got nothing. Money vanished!

Recovery solution: Must UNDO the first UPDATE to restore Alice's balance.

Concert Ticket Purchase

Problem: User has receipt, paid money, but restart shows ticket as available.

Recovery solution: Must REDO all changes to honor the committed purchase.


Example: DB State after Single Transaction (T56)

Consider a single transaction (T56) in which we resell tickets from 3 users. After T56 is done, the database can only be in one of two states. The Transaction Manager coordinates everything here.

Transaction T56 lifecycle showing commit/abort paths and database states

Transaction Goals

Atomicity: All changes happen, or none do
Durability: Committed changes survive crashes
Two Paths: COMMIT (make permanent) or ABORT (undo everything)

Why Traditional Approaches Fail

"Copy Everything" Approach

Idea: Make complete database copy before each transaction
Reality: 1TB database + 1000 transactions/sec = 1000TB/sec copying
Verdict: Impossible at scale

"Save After Every Change" Approach

Idea: Write every change immediately to permanent storage (e.g., disk)
Reality: Disk IOs are slow, RAM access takes 100ns (50,000x slower!)
Verdict: Performance death sentence

The Recovery Solution Preview

Modern databases solve this with an elegant approach:

Change Tracking (not copying)

Insight: Track what changed, not entire state
Efficiency: Log 10 changes vs. copy 10 million rows

Write-Ahead Logging

Insight: Write changes to fast sequential log before slow random storage
Guarantee: Can reconstruct any state from logged changes