Case Study 4.3: The Axess Waitlist Crash (Recovery)
The Problem: The Midnight Batch
At 1:00 AM, Stanford's Axess system kicks off its nightly Waitlist Processing Batch. It's a routine operation with a glaring flaw.
The Setup:
-
Transaction 1 (
T1- The Batch): This automated script is a workhorse, updating thousands of rows by reallocating seats, managing waitlists, and recalculating tuition. Its Achilles' heel? It takes its time and doesn't commit instantly. -
Transaction 2 (
T2- The Student): Meanwhile, a student, likely fueled by caffeine, decides to drop a class at 1:05:00 AM. This transaction commits immediately. -
The Failure: A second later, the primary database server loses power. The memory buffer hasn't been flushed to disk.
The State of the System
By 1:15 AM, the server is back up, but the disk data is a train wreck.
-
T1's changes are in limbo, partially written before the crash. -
T2's changes vanished into thin air, despite the student's "Success" notification.
If the database resumes operations:
-
Durability Failure:
T2's changes are nowhere to be found. The student is still enrolled. -
Atomicity Failure:
T1is incomplete. Students face the risk of being charged tuition without enrollment. The data is a mess.
The Solution: The Write-Ahead Log (WAL)
Stanford's database employs Write-Ahead Logging, a safeguard ensuring that no changes occur in memory without first being logged on disk.
As Axess restarts, the Transaction Manager enters Recovery Mode, scanning the WAL from the last checkpoint onward.
Show Solution: The Recovery Process
The database reviews the logs and categorizes the transactions:
| Transaction | Status in WAL | Required Action |
|---|---|---|
T2 |
Found BEGIN ... COMMIT |
Must be REDO'd. |
T1 |
Found BEGIN ... (No Commit) |
Must be UNDO'd. |
Phase 1: REDO (Ensuring Durability)
The database moves forward through the log, reapplying T2's changes. The student's class drop is reinstated, updating the disk with the new data from the log.
Phase 2: UNDO (Ensuring Atomicity)
The database moves backward through the log, undoing every change T1 made. It uses 'Before Images' in the log to reset the rows to their state at 12:59:59 AM. The waitlist batch is erased as if it never happened.
Summary: The Dual Mandate of Recovery
Crash recovery is about more than just salvaging data. It's about enforcing ACID guarantees when hardware fails.
-
REDO ensures that committed data transitions from RAM to Disk, securing Durability.
-
UNDO shields the disk from incomplete operations, upholding Atomicity.
Final Note: By rolling back
T1, the database maintains system integrity. Once recovery is complete, the Registrar reruns the Waitlist Batch. No data corruption, no tuition errors.