Case Study 4.3: The Axess Waitlist Crash (Recovery)

Axess Recovery Case Study Reading Time: 10 mins

The Problem: The Midnight Batch

At 1:00 AM, Stanford's Axess system kicks off its nightly Waitlist Processing Batch. It's a routine operation with a glaring flaw.

The Setup:


The State of the System

By 1:15 AM, the server is back up, but the disk data is a train wreck.

If the database resumes operations:

  1. Durability Failure: T2's changes are nowhere to be found. The student is still enrolled.

  2. Atomicity Failure: T1 is incomplete. Students face the risk of being charged tuition without enrollment. The data is a mess.


The Solution: The Write-Ahead Log (WAL)

Stanford's database employs Write-Ahead Logging, a safeguard ensuring that no changes occur in memory without first being logged on disk.

As Axess restarts, the Transaction Manager enters Recovery Mode, scanning the WAL from the last checkpoint onward.

Show Solution: The Recovery Process

The database reviews the logs and categorizes the transactions:

Transaction Status in WAL Required Action
T2 Found BEGIN ... COMMIT Must be REDO'd.
T1 Found BEGIN ... (No Commit) Must be UNDO'd.

Phase 1: REDO (Ensuring Durability) The database moves forward through the log, reapplying T2's changes. The student's class drop is reinstated, updating the disk with the new data from the log.

Phase 2: UNDO (Ensuring Atomicity) The database moves backward through the log, undoing every change T1 made. It uses 'Before Images' in the log to reset the rows to their state at 12:59:59 AM. The waitlist batch is erased as if it never happened.


Summary: The Dual Mandate of Recovery

Crash recovery is about more than just salvaging data. It's about enforcing ACID guarantees when hardware fails.

Final Note: By rolling back T1, the database maintains system integrity. Once recovery is complete, the Registrar reruns the Waitlist Batch. No data corruption, no tuition errors.