WAL & ARIES Recovery
Learn the mechanisms of Write-Ahead Logging (WAL) and the ARIES algorithm used to restore database durability and transactional integrity after system crashes.
The Concept
Modern databases prioritize durability: once a transaction commits, its changes must survive power loss, operating system crashes, or hardware failure. However, writing every modified memory page directly to disk at commit time is too slow. Disk writes are slow, random operations, and forcing them immediately would bottleneck database throughput.
To achieve fast transaction processing alongside data safety, databases use a hybrid model:
- They keep active data pages in volatile RAM inside a buffer pool.
- They append transaction records sequentially to an append-only file on disk called the Write-Ahead Log (WAL).
Because sequential disk appends are orders of magnitude faster than random writes to data pages, the database can safely return a success response to the client once the log records are flushed to disk. The actual modified pages in RAM (called dirty pages) are written back to disk asynchronously in the background.
If the system crashes before dirty pages are flushed to disk, the database uses the WAL to rebuild the in-memory state during startup. The industry standard protocol for coordinating this process is the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery method.
The WAL Protocol: Steal and No-Force
To balance performance and crash recovery, databases choose specific buffer pool management policies:
- Steal vs. No-Steal: A steal policy allows the buffer pool manager to evict a dirty page containing uncommitted transaction modifications to make room for other queries. A no-steal policy forbids this. A steal policy yields better memory utilization but requires a mechanism to undo uncommitted changes if the system crashes.
- Force vs. No-Force: A no-force policy allows a transaction to commit without forcing its modified pages to be flushed to disk immediately. A force policy requires flushing all modified pages before returning success. A no-force policy increases write performance but requires a redo mechanism to recover committed updates that were only stored in volatile RAM.
Most production databases choose a Steal/No-Force engine layout because it maximizes performance and memory utilization. The recovery engine relies on the WAL to handle the complexity:
- Redo: Replays committed updates that were lost because of the No-Force policy.
- Undo: Reverses uncommitted changes that were written to disk because of the Steal policy.
The Rule of Write-Ahead Logging
For a Steal/No-Force system to remain safe, it must enforce the fundamental WAL invariant:
The database must flush the log record representing a page update to disk before the actual modified data page is written to disk.
If this invariant is violated, and the database writes the modified page to disk but crashes before writing the corresponding log record, the database cannot undo the uncommitted change during recovery, violating atomicity.
ARIES Crash Recovery
When a database using ARIES restarts after a crash, it reads the log file to reconstruct the state of the database at the moment of the crash. The ARIES protocol does this in three successive phases.
1. The Analysis Phase
The analysis phase reconstructs the state of the database at the time of the crash. It scans the log forward starting from the last checkpoint record. During this scan, it builds two key dynamic tables:
- Transaction Table: Lists all transactions that were active when the system crashed.
- Dirty Page Table (DPT): Lists all pages in RAM that were modified but not flushed to disk at the time of the crash. For each page, it tracks the oldest unwritten log sequence number, the
RecLSN(Recovery LSN).
At the end of this phase, transactions marked as active are classified as loser transactions because they did not write a commit log before the crash. These transactions will be rolled back in the Undo phase.
2. The Redo Phase (Repeating History)
The redo phase restores the database state to the exact moment of the crash. Starting at the oldest unwritten modification identified in the Analysis phase (the minimum RecLSN in the DPT), ARIES scans the log forward and replays all updates.
This is called repeating history because the engine replays updates for all transactions, including those that were eventually aborted or uncommitted at the crash point. This ensures that any page splits, allocation steps, or changes are restored to their physical state before rollback operations begin, simplifying recovery logic.
To avoid redundant disk writes, ARIES skips redoing a log record if:
- The page is not listed in the Dirty Page Table.
- The page is in the DPT, but the log record LSN is less than the page's
RecLSN. - The actual page LSN retrieved from disk matches or exceeds the log record LSN, indicating the change was already written.
3. The Undo Phase (Rolling Back Losers)
Once the database state is restored to the crash point, ARIES reverses the changes made by loser transactions. The engine scans the log backward from the crash point, processing records for all uncommitted transactions.
As ARIES rolls back changes, it performs the inverse of each update operation and writes a Compensation Log Record (CLR) to the log. A CLR record contains:
- The details of the undone operation.
- An
UndoNextLSNfield pointing to the next LSN that needs to be undone for the transaction. This pointer is copied from the undone record'sPrevLSN.
The UndoNextLSN field prevents duplicate rollback work. If the database crashes again during the recovery process, the next recovery run will read the CLRs in the Redo phase, and the Undo phase will use the UndoNextLSN pointers to resume rollback from where it left off, avoiding repeating previous undo operations.
Recovery Optimizations
Fuzzy Checkpointing
Parsing a log file from start to finish is slow on large databases. To limit recovery time, databases write periodic checkpoints. A naive checkpoint pauses all transactions, flushes all dirty pages to disk, and writes a checkpoint record. This degrades transaction throughput.
To avoid write spikes, ARIES uses fuzzy checkpointing:
- A checkpoint record is written containing the current Transaction Table and Dirty Page Table, but the database does not force dirty pages to disk.
- The database continues processing active transactions.
- The background writer thread continues flushing dirty pages to disk over time.
During crash recovery, ARIES only needs to scan back to the oldest RecLSN listed in the checkpoint's Dirty Page Table. Any updates before this point are guaranteed to be flushed to disk, limiting the log segment that must be parsed.
Further Reading
- The ARIES Recovery Method Paper — C. Mohan's seminal research paper introducing the ARIES recovery algorithm.
- Database System Concepts: Crash Recovery — Textbook chapter detailing WAL protocols, checkpoints, and ARIES recovery steps.
- PostgreSQL WAL Internals — The official PostgreSQL guide explaining the design of write-ahead logging.
Prerequisites
Code Examples
Core Literature References
ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging
by C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz — ACM Transactions on Database Systems (TODS), pp. 94-162
View sourceDatabase Management Systems
by Raghu Ramakrishnan and Johannes Gehrke — Chapter 18: Crash Recovery, pp. 579-610
View sourceContinue learning
ACID & Isolation Levels
Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.
API Gateways
Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.
API Security & OAuth 2.0
Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.