Distributed Transactions & Sagas

When transactions span multiple database boundaries, the Saga pattern coordinates a sequence of local transactions and compensating rollbacks to guarantee eventual consistency.

AdvancedReliability & ScaleChapter: Reliability & Scalability15 min read

The Challenge of Distributed Transactions

In a monolithic application, maintaining transaction safety is straightforward, the system interacts with a single database, and you can wrap operations in a standard SQL transaction block (BEGIN / COMMIT). If anything fails, the database automatically rolls back all changes, upholding ACID guarantees.

In a microservices architecture, this simplicity disappears. Business transactions are distributed across multiple independent services, each with its own local database. For example, checking out a shopping cart might require:

Order Service to create an order record.
Payment Service to charge the user's credit card.
Inventory Service to reserve the physical items.

Because these services run on separate nodes and databases, standard local SQL transactions cannot coordinate them. If the payment succeeds but the inventory check fails, the system is left in an inconsistent state: a customer has been charged for items that are out of stock.

The Two-Phase Commit (2PC) Protocol

Historically, distributed systems used Two-Phase Commit (2PC) to enforce ACID properties across databases. As the name implies, 2PC operates in two phases, managed by a centralized coordinator:

Phase 1: Prepare Phase

The coordinator asks all participating nodes if they are ready to commit their work. Each participant performs safety checks, acquires necessary local database locks on the affected rows, and replies with a vote (Yes or No).

Phase 2: Commit Phase

If all participants voted Yes, the coordinator broadcasts a commit instruction, and the participants make their changes permanent. If any participant voted No, or failed to respond within a timeout, the coordinator broadcasts a rollback instruction, and all participants abort their local transactions.

If the coordinator crashes midway through Phase 2, participants are left in a state of limbo. They hold database locks and cannot commit or abort because they do not know what the coordinator decided. They must wait for the coordinator to recover.

Limitations of Two-Phase Commit

While 2PC guarantees strict data consistency, it is rarely used in modern, high-throughput microservices due to major trade-offs:

Blocking Nature: Nodes must hold database locks on records from the start of the Prepare phase until they receive the Commit/Abort instruction. This reduces query throughput and increases lock contention.
Latency Overhead: 2PC requires multiple synchronous network round-trips between the coordinator and all participants, amplifying latency.
Single Point of Failure: If the coordinator crashes while participants are in the prepared state, those database resources remain locked indefinitely, impacting system availability.

For these reasons, modern distributed systems trade strict ACID properties for eventual consistency, moving from 2PC to the Saga pattern.

Diagram: Two-Phase Commit vs Saga Rollback

The following diagram contrasts the blocking, lock-holding nature of 2PC with the non-blocking, compensating rollback flow of a Saga:

The Saga Pattern: Eventual Consistency

A Saga is a design pattern that models a distributed transaction as a series of independent local transactions. Each service executes its own local database transaction and publishes an event or message. Subsequent services listen to these messages and trigger their own local transactions.

Because each service commits its local transaction immediately, no global database locks are held. This increases performance but means the system only guarantees eventual consistency (BASE model, standing for Basically Available, Soft state, Eventual consistency) rather than strict immediate consistency.

If all local transactions succeed, the saga completes. If any local transaction fails, the saga must execute a series of compensating transactions to revert the changes made by preceding steps, rolling back state in reverse order (LIFO).

Saga Orchestration vs Saga Choreography

There are two primary ways to coordinate a saga:

Orchestration

An Orchestration Saga relies on a centralized coordinator service (the Orchestrator). The orchestrator acts as a brain, explicitly telling each service which local transaction to execute next, parsing responses, and coordinating compensating rollbacks if a step fails.

Pros: Simple to reason about, centralizes control flow, avoids cyclic service dependencies.
Cons: Introduces a single service coordinate point, can concentrate business logic.

Choreography

A Choreography Saga uses event-driven coordination. Each service executes its local transaction, and publishes an event (e.g. OrderCreated). Other services subscribe to these events and execute their logic in response (e.g. the Payment Service charges the card when it detects OrderCreated, then emits PaymentSucceeded).

Pros: Highly decoupled, no single coordinator bottleneck.
Cons: Difficult to trace the execution path, potential for cyclic dependencies, complex to reason about under error conditions.

Compensating Transactions: Reverting State

Unlike 2PC, which aborts transactions in memory, a Saga commits data to the database at every step. Therefore, a rollback cannot rely on a database-level undo. Instead, developers must explicitly design a compensating transaction for every forward step:

If the forward step is ChargeCard($100), the compensating transaction is RefundCard($100).
If the forward step is ReserveInventory(ItemA), the compensating transaction is ReleaseInventory(ItemA).

The Lack of Isolation

Because intermediate steps commit their data immediately, other concurrent requests can read the intermediate state. This violates the Isolation property of ACID, leading to potential anomalies:

Dirty Reads: Another transaction might see and act on the order created by step 1, even if the saga eventually fails and rolls back.
Lost Updates: An intermediate step changes a database record, which is overwritten by another concurrent task before the compensation runs.

To mitigate these, developers apply application-level defense patterns, such as using an Outbox Pattern to safely send events, writing idempotent event handlers, and utilizing reconciliation loops to locate and resolve mismatched states asynchronously.

Prerequisites

ACID & Isolation Levels Message Queues

Code Examples

Core Literature References

Sagas

by Hector Garcia-Molina and Kenneth Salem — Section 2: Saga Definition, pp. 9-15

View source

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.