Distributed Consensus & Raft

The Raft consensus algorithm enables a cluster of machines to operate as a single, highly available state machine by electing a leader and replicating a consistent log of events.

AdvancedReliability & ScaleChapter: Reliability & Scalability20 min read

The Consensus Problem in Distributed Systems

In a single-node system, preserving data consistency is straightforward, the server decides what happens and when. In a distributed system consisting of multiple independent nodes, this becomes a fundamental challenge. Networks are unreliable, packets are delayed or dropped, nodes crash without warning, and temporary network partitions isolate subsets of nodes.

The consensus problem is the task of getting a cluster of independent nodes to agree on a sequence of state changes. If the nodes cannot agree, different replicas will drift, resulting in data corruption and split-brain scenarios where separate parts of the cluster accept conflicting writes.

To resolve this, systems rely on State Machine Replication (SMR). The core principle is that if all nodes start in the identical initial state, and apply the exact same sequence of input commands in the same order, they will arrive at the same final state. Consensus algorithms like Raft and Paxos exist to ensure that all replicas agree on the exact contents and ordering of this replicated log.

Cluster Quorums and Fault Tolerance

Distributed consensus algorithms do not require all nodes in a cluster to be online to progress. Instead, they rely on a quorum, which is a strict majority of the nodes.

If a cluster has N members, a quorum is defined as:

Quorum = (N / 2) + 1

To survive F node failures, the cluster must contain at least 2F + 1 members.

A cluster of 3 nodes can tolerate 1 failure, since a quorum of 2 nodes remains active.
A cluster of 5 nodes can tolerate 2 failures, since a quorum of 3 nodes remains active.

Using a majority quorum ensures that even during a network partition, only one partition can contain a majority of nodes. The partition containing the minority will fail to form a quorum and refuse to accept updates, preventing conflicting states from being written to different halves of the network.

Raft Node States and Term Epochs

Raft divides time into terms of arbitrary length, which act as logical clocks. Terms are numbered with consecutive integers. Each term begins with a leader election. If a leader wins, it coordinates the cluster for the remainder of that term. If an election results in a split vote, the term ends, and a new term begins immediately with another election.

At any given moment, a Raft node exists in one of three states:

Follower: Passive nodes that only respond to incoming requests from other nodes. If followers receive no communication within an election timeout, they transition to Candidates.
Candidate: Active nodes trying to win an election. They increment the term counter, vote for themselves, and broadcast vote requests to all other nodes.
Leader: The single active coordinator of the cluster. The leader handles all client writes, replicates log entries, and sends periodic heartbeats to assert authority.

Diagram: Raft State Transitions

The following state machine details how nodes transition between Follower, Candidate, and Leader roles:

The Raft Protocol Phases

Raft organizes consensus into two distinct, sequential phases:

Phase 1: Leader Election

Nodes begin as Followers. Each node runs a randomized election timeout (typically between 150ms and 300ms). If a follower hears nothing from a leader before its timer expires:

It transition to Candidate.
It increments its term counter and votes for itself.
It sends a RequestVote RPC to its peers.
If it receives votes from a majority of nodes in the cluster, it becomes the Leader.

The timeout randomization is crucial. It ensures that nodes rarely time out at the exact same instant, avoiding split votes where no candidate receives a majority.

Phase 2: Log Replication

Once a leader is elected, it begins serving clients:

A client sends a write command to the leader.
The leader appends the command to its local log as an uncommitted entry.
The leader sends an AppendEntries RPC to all followers.
Followers write the entry to their local logs and acknowledge receipt.
Once a majority of followers acknowledge, the leader commits the entry, applies it to its local state machine, and returns success to the client.
The leader notifies followers of the commit status in subsequent heartbeats, prompting them to apply the command to their state machines.

Core Safety Invariants

To guarantee absolute correctness, Raft enforces five core safety properties:

Election Safety: At most one leader can be elected per term.
Leader Append-Only: A leader never overwrites or truncates its own log entries, it only appends new ones.
Log Matching: If two logs contain an entry with the identical index and term, they are guaranteed to be identical up to that index.
Leader Completeness: If a log entry is committed in a given term, that entry will be present in the logs of the leaders for all higher-numbered terms. Candidates whose logs are less up-to-date than a follower's log will be denied that follower's vote.
State Machine Safety: If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

Cluster Membership Changes and Log Compaction

Real-world clusters must occasionally add or remove nodes and prevent logs from growing indefinitely:

Joint Consensus: During configuration changes (adding or removing nodes), Raft uses a two-phase joint consensus configuration mechanism to prevent split-brain states where two independent majorities could be formed under the old and new configurations.
Log Compaction: Raft prevents log bloat through snapshotting. Each node periodically saves its current state to disk and discards the log entries leading up to that state. If a follower lags too far behind, the leader sends the raw snapshot via an InstallSnapshot RPC instead of incremental log entries.

Prerequisites

Vector Clocks

Code Examples

Core Literature References

In Search of an Understandable Consensus Algorithm

by Diego Ongaro and John Ousterhout — Section 5: The Raft Consensus Algorithm, pp. 305-320

View source

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.