Case Study 4.3: The Axess Waitlist Crash (Recovery)

Axess Recovery Case Study Reading Time: 10 mins

The Problem: The Midnight Batch

At 1:00 AM, Stanford's Axess system kicks off its nightly Waitlist Processing Batch. It's a routine operation with a glaring flaw.

The Setup:

Transaction 1 (T1 - The Batch): This automated script is a workhorse, updating thousands of rows by reallocating seats, managing waitlists, and recalculating tuition. Its Achilles' heel? It takes its time and doesn't commit instantly.
Transaction 2 (T2 - The Student): Meanwhile, a student, likely fueled by caffeine, decides to drop a class at 1:05:00 AM. This transaction commits immediately.
The Failure: A second later, the primary database server loses power. The memory buffer hasn't been flushed to disk.

The State of the System

By 1:15 AM, the server is back up, but the disk data is a train wreck.

T1's changes are in limbo, partially written before the crash.
T2's changes vanished into thin air, despite the student's "Success" notification.

If the database resumes operations:

Durability Failure: T2's changes are nowhere to be found. The student is still enrolled.
Atomicity Failure: T1 is incomplete. Students face the risk of being charged tuition without enrollment. The data is a mess.

The Solution: The Write-Ahead Log (WAL)

Stanford's database employs Write-Ahead Logging, a safeguard ensuring that no changes occur in memory without first being logged on disk.

As Axess restarts, the Transaction Manager enters Recovery Mode, scanning the WAL from the last checkpoint onward.

Show Solution: The Recovery Process

The database reviews the logs and categorizes the transactions:

Transaction	Status in WAL	Required Action
`T2`	Found `BEGIN` ... `COMMIT`	Must be REDO'd.
`T1`	Found `BEGIN` ... (No Commit)	Must be UNDO'd.

Phase 1: REDO (Ensuring Durability) The database moves forward through the log, reapplying T2's changes. The student's class drop is reinstated, updating the disk with the new data from the log.

Phase 2: UNDO (Ensuring Atomicity) The database moves backward through the log, undoing every change T1 made. It uses 'Before Images' in the log to reset the rows to their state at 12:59:59 AM. The waitlist batch is erased as if it never happened.

Summary: The Dual Mandate of Recovery

Crash recovery is about more than just salvaging data. It's about enforcing ACID guarantees when hardware fails.

REDO ensures that committed data transitions from RAM to Disk, securing Durability.
UNDO shields the disk from incomplete operations, upholding Atomicity.

Final Note: By rolling back T1, the database maintains system integrity. Once recovery is complete, the Registrar reruns the Waitlist Batch. No data corruption, no tuition errors.