Distributed Locking

Understand how distributed locks prevent race conditions across multi-node services, and how fencing tokens prevent out-of-order writes during garbage collection pauses.

AdvancedReliabilityChapter: Reliability & Scalability15 min read

The Purpose of Distributed Locks

In a single-process application, multiple threads coordinate access to shared memory using mutual exclusion structures like mutexes or semaphores managed directly by the operating system kernel. However, modern backends run across multiple independent server instances. When these instances need to coordinate access to a shared resource that does not natively support transactions, standard in-memory locks are useless.

A distributed lock is a lease mechanism shared across multiple computer nodes. Its primary purpose is to ensure that only one node at a time can perform a specific operational action. Common use cases include:

  • Preventing multiple instances of a cron service from running the same end-of-month billing job.
  • Ensuring that only one worker node compiles a resource-heavy cache file.
  • Safeguarding non-relational storage blocks from concurrent, overlapping modifications.

If multiple nodes execute these critical actions simultaneously, it leads to split-brain scenarios, data duplication, or resource corruption.


Single-Node Lock Structures

The simplest way to implement a distributed lock is to store a key-value record in a fast, centralized database like Redis.

A client attempts to acquire the lock by writing a unique key with a short time-to-live (TTL) expiration using a conditional write:

bash
# Set key if it does not exist, with an expiration time of 10000 milliseconds
SET resource_lock client_identifier NX PX 10000

If the command succeeds, the client holds the lock. Once finished, the client deletes the key. The TTL ensures that if the client crashes or becomes disconnected, the lock will eventually expire, preventing a permanent deadlock.

Failover and Crash Vulnerabilities

While fast, this single-node model is fragile. In production, Redis is typically configured with a primary node and one or more replica nodes. If the primary node crashes after confirming the lock write but before replicating the key to the standby replica, a failover will promote the replica to primary.

Since the replica does not contain the key, another client can immediately acquire the lock, violating mutual exclusion.

text
[Client 1] -----> (Primary: Lock Set) -- [Crashes before Sync] --> (Replica)
[Client 2] ------------------------------------------------------> (Replica Promoted to Primary: Lock Set!)

The Redlock Algorithm

To solve the single-node failover problem, Redis developers proposed the Redlock algorithm. Instead of a single primary instance, Redlock uses multiple fully independent Redis master nodes (typically five).

To acquire the lock, a client performs the following steps:

  1. Records the current timestamp.
  2. Attempts to acquire the lock key in all five instances sequentially, using the same key name and a unique random value, with a lock acquisition timeout that is small compared to the lock's auto-release time.
  3. Computes the elapsed time to acquire all locks. If the client successfully acquires the lock from a majority of nodes (at least three out of five) and the elapsed time is less than the lock validity time, the lock is considered acquired.
  4. If acquired, the lock validity time is defined as the initial validity time minus the time spent acquiring it.
  5. If the client fails to acquire the majority or the validity time becomes negative, it immediately sends an unlock script (which deletes the key if the value matches) to all instances.

Academic Correctness Critiques

Distributed systems researchers, most notably Martin Kleppmann, have criticized Redlock. They argue that Redlock is unsafe because it relies on a synchronous system model with assumptions about physical clock drift, which does not hold true in real-world networks.

If a node experiences a sudden clock jump (due to an NTP sync update) or a long network partition, the lease duration can expire on one node earlier than others, allowing a second client to claim the lock.


Consensus-Backed Locking

For strict safety guarantees, engineers prefer consensus-backed engines like etcd or Consul. These systems utilize consensus protocols like Raft to manage state replication.

text
Client ----[Create Lease with TTL]----> etcd Leader
Leader ----[Consensus replicated]-----> Followers
Client <---[Lease ID returned]--------- Leader

In etcd, locking is tied to a lease structure:

  • Leases: A client requests a lease with a specific Time-to-Live (TTL). The lease is replicated to follower nodes via the Raft consensus log.
  • Keep-Alive: The client must continuously send heartbeat requests (keep-alive pings) to renew the lease before the TTL expires.
  • Auto-Teardown: If the client crashes, the heartbeat stops, the lease expires, and etcd automatically deletes all keys bound to that lease, releasing the lock.

This ensures that the lock state is durably replicated and verified by a quorum of nodes before the client receives confirmation.


The Out-of-Order Write Problem

Even with a consensus-backed lock, a major safety issue remains: the client itself can experience a pause after acquiring the lock but before writing to storage.

This is most commonly caused by a stop-the-world garbage collection (GC) pause, virtual machine migration, or network congestion.

The Fencing Token Solution

To prevent out-of-order writes during GC pauses, you must use fencing tokens. A fencing token is a monotonic, increasing number returned by the lock service every time a lock is acquired.

When writing to the target storage system, the client must include this fencing token. The storage engine tracks the highest token it has processed and rejects any write requests containing a lower token number.

Fencing Tokens in Distributed Locking Lock Service e.g. etcd / ZooKeeper Client 1 Garbage Collection Pause Client 2 Runs normally Storage Service Last Token Verified: 34 1. Acquire (Token=33) 2. Acquire (Token=34) 3. Write (Token=34) - OK 4. Delayed Write (Token=33) X REJECTED (33 < 34)

In this diagram, Client 1 is suspended due to a GC pause immediately after obtaining Token 33. The lease expires, and Client 2 claims the lock, receiving Token 34. Client 2 writes its update to the storage node. When Client 1 wakes up and attempts to submit its write with Token 33, the storage engine compares 33 against its recorded high watermark of 34 and safely rejects it.


Lease Renewal and Lock Classification

A primary challenge of leasing is choosing an appropriate TTL. If the TTL is too short, a long-running transaction might lose the lock mid-execution. If the TTL is too long, a crashed client will keep the resource locked for an excessive duration, stalling operations.

To balance this, modern client libraries run a background lease-renewal thread (sometimes called a watchdog or lease-keep-alive loop). The client acquires a lock with a short TTL (e.g. 5 seconds) and, while processing the task, sends periodic heartbeat RPCs to extend the TTL. If the process crashes or gets partitioned, the heartbeats stop, and the lock is freed within the 5-second window.

Shared vs Exclusive Locks

Depending on the operational workload, you can configure locks with different accessibility rules:

  • Shared Locks (Reader): Multiple nodes can hold the lock concurrently as long as they are only performing read operations.
  • Exclusive Locks (Writer): Only a single node can hold the lock, blocking all other readers and writers.

This classification maximizes read performance while preserving strict consistency during mutations.


Further Reading

Code Examples

Core Literature References

How to Do Distributed Locking

by Martin Kleppmann — Self-published, pp. N/A

View source

Distributed Locks with Redis

by Redis Documentation — Redis Patterns, pp. N/A

View source