Vector Clocks

Vector clocks give each node in a distributed system a logical timestamp that captures causal relationships between events, enabling conflict detection without a global clock.

AdvancedReliabilityChapter: Reliability & Scalability12 min read

The problem: no global clock

In a single-machine system, time is easy. The OS gives you a monotonically increasing clock and every event has an unambiguous timestamp.

In a distributed system, you have N machines each with their own hardware clock. Those clocks drift. NTP corrects them, but only approximately, and a correction can even move the clock backward. You cannot trust wall-clock time to tell you which of two writes on different nodes happened first.

This is not an academic concern. When two users edit the same record on two different replicas, and both edits are applied without knowing the order, you get silent data corruption.

Lamport timestamps: the first step

Leslie Lamport's 1978 solution introduces a logical clock: a simple counter that does not measure real time, but instead captures a consistent ordering of events.

The rules are:

  1. Before each event, increment your local counter.
  2. When you send a message, attach your current counter value.
  3. When you receive a message, set your counter to max(local, received) + 1.

This guarantees: if event A causally precedes event B (A's result was used to produce B), then timestamp(A) < timestamp(B).

The limitation: Lamport timestamps are a total order but they cannot distinguish causal ordering from coincidence. If two events have timestamps 5 and 7, you know the one with 5 was assigned its timestamp before the one with 7, but you cannot tell whether 5 actually influenced 7 or whether they happened completely independently on different nodes. You need a richer data structure to answer that question.

Vector clocks: one counter per node

A vector clock extends Lamport's idea by keeping one counter per node. Each node maintains a vector like [A:2, B:1, C:0] meaning: "I have seen 2 events from A, 1 from B, and 0 from C (or C is unknown to me)."

The update rules are:

  1. Before each local event on node X, increment clock[X].
  2. When node X sends a message, attach its full vector.
  3. When node X receives a message with vector V, merge by taking the element-wise max of V and the local vector, then increment clock[X].

The vector now encodes what the node knows about every other node's history. If your vector says B:1, you know you have received (directly or indirectly) exactly one event from node B.

Visualising causality and concurrency

Node A Node B Node C a1 [A:1,B:0,C:0] b1 [A:1,B:1,C:0] a2 [A:2,B:0,C:0] b2 [A:1,B:2,C:0] concurrent (conflict!) c1 [A:1,B:2,C:1] Event Message carries clock Concurrent writes

In the diagram, a2 on node A and b2 on node B are concurrent: neither has received a message from the other before writing. Their vector clocks ([A:2,B:0,C:0] and [A:1,B:2,C:0]) are incomparable: neither is element-wise less than or equal to the other. That incomparability is the signal of a conflict.

Happens-before vs concurrent

Formally, event X happens-before Y (written X -> Y) if:

  • Every component of X's vector clock is less than or equal to Y's corresponding component, and
  • At least one component is strictly less than Y's.

If neither X -> Y nor Y -> X holds, the events are concurrent. They may have produced conflicting writes to the same key.

Conflict detection in real systems

Amazon DynamoDB (original design, documented in the 2007 Dynamo paper) used vector clocks to detect concurrent writes to the same key. When two concurrent versions of a record were detected, the system stored both and returned them as siblings to the application. The application code was responsible for merging them. In practice this was hard to get right, so DynamoDB later switched to last-write-wins with a physical timestamp as the default.

Riak (a distributed key-value store descended from the Dynamo design) still exposes siblings when siblings are enabled. Applications can implement custom merge logic, which is the correct approach for collaborative document editing or shopping carts where both writes may contain valid data.

Modern alternative: CRDTs

Conflict-free Replicated Data Types (CRDTs) sidestep the conflict problem by choosing data structures that always produce a deterministic merge result, regardless of the order in which concurrent writes are applied.

Examples:

  • G-Counter (Grow-only counter): each node maintains its own counter and the total is the sum. Incrementing on any node is always safe.
  • LWW-Register (Last-Write-Wins Register): whichever write has the highest timestamp wins. Simple, but loses data.
  • OR-Set (Observed-Remove Set): add and remove operations are tagged with unique identifiers. The set can be merged without ambiguity.

CRDTs are used in collaborative text editors (Figma, Notion), distributed counters (CDN edge caches), and eventually-consistent databases. The tradeoff is that not every data structure has a CRDT equivalent, and some CRDT semantics are surprising (e.g. a deletion may reappear after a merge).

Scalability limit: dotted version vectors

Plain vector clocks grow linearly with the number of nodes. In a cluster of 500 machines, every version vector is 500 integers. This becomes expensive in memory and on the wire.

Dotted version vectors are a practical compression used by Riak and other systems. They separate the causal context (what you know about the whole cluster) from the specific version dot (which exact event produced this value), allowing compaction without losing causality information.

Code example

python
from typing import Dict

VectorClock = Dict[str, int]

def merge(a: VectorClock, b: VectorClock) -> VectorClock:
    """Return a new clock with the element-wise maximum of a and b."""
    keys = set(a) | set(b)
    return {k: max(a.get(k, 0), b.get(k, 0)) for k in keys}

def happens_before(a: VectorClock, b: VectorClock) -> bool:
    """True if a causally precedes b (a -> b)."""
    keys = set(a) | set(b)
    all_leq = all(a.get(k, 0) <= b.get(k, 0) for k in keys)
    at_least_one_lt = any(a.get(k, 0) < b.get(k, 0) for k in keys)
    return all_leq and at_least_one_lt

def concurrent(a: VectorClock, b: VectorClock) -> bool:
    """True if neither a -> b nor b -> a (potential conflict)."""
    return not happens_before(a, b) and not happens_before(b, a)

# Two nodes write independently
write_a = {"A": 2, "B": 1, "C": 0}
write_b = {"A": 1, "B": 2, "C": 0}

print(happens_before(write_a, write_b))  # False
print(happens_before(write_b, write_a))  # False
print(concurrent(write_a, write_b))       # True — conflict!

merged = merge(write_a, write_b)
print(merged)  # {"A": 2, "B": 2, "C": 0}

merge takes the element-wise max to produce the most up-to-date combined view of the system. happens_before implements the formal definition: all components must be <=, with at least one strictly <. When concurrent returns True, the application must resolve the conflict explicitly.

Prerequisites

Code Examples