CPU Cache Coherency

Explore how multi-core CPUs keep cache memories in sync, coordinate state transitions, and avoid concurrency bottlenecks like false sharing.

AdvancedFoundationsChapter: Foundations15 min read

The CPU Cache Hierarchy

Modern CPUs execute instructions in less than a nanosecond, but retrieving data from main RAM takes about 50 to 100 nanoseconds. If a CPU core had to wait for RAM on every instruction, it would spend most of its cycles idling.

To bridge this latency gap, processors use a hierarchy of small, high-speed memory buffers called caches built directly into the silicon:

  • L1 Cache: The fastest cache, dedicated to a single CPU core. It is divided into an L1 instruction cache and an L1 data cache. Access latency is typically 1 to 4 clock cycles.
  • L2 Cache: Slightly larger than L1, also dedicated to a single core. Access latency is typically 10 to 15 clock cycles.
  • L3 Cache: A much larger cache shared among all cores on a single CPU die. Access latency is typically 40 to 75 clock cycles.
  • Main RAM: The system memory, shared by all cores. Access latency is 100 to 200+ cycles.
text
                    ┌──────────────────────────────────┐
                    │            L3 Cache              │
                    │         (Shared, 40-75c)         │
                    └───────────┬──────────────┬───────┘
                                │              │
                  ┌─────────────┴─┐      ┌─────┴─────────┐
                  │   L2 Cache    │      │   L2 Cache    │
                  │ (Private, 12c)│      │ (Private, 12c)│
                  └──────┬────────┘      └─────┬─────────┘
                         │                     │
                  ┌──────┴────────┐      ┌─────┴─────────┐
                  │   L1 Cache    │      │   L1 Cache    │
                  │  (Private, 4c)│      │  (Private, 4c)│
                  └──────┬────────┘      └─────┬─────────┘
                         ▼                     ▼
                     CPU Core 1            CPU Core 2

Cache Lines

Caches do not load memory byte-by-byte. Instead, they organize data transfers into fixed-size blocks called cache lines. On almost all modern x86 and ARM processors, a cache line is 64 bytes in size.

When you read a single 8-byte integer from memory, the CPU loads the entire 64-byte block containing that integer into the L1 cache. This design relies on the principle of spatial locality: the assumption that if you access one memory address, you are likely to access nearby addresses soon after (such as when iterating through an array).


Coherency Coordination: The MESI Protocol

Because each CPU core has its own private L1 and L2 caches, multi-core systems face a synchronization challenge. If Core 1 edits a variable cached in its private L1 store, and Core 2 attempts to read that same address, Core 2 will receive stale data unless the caches coordinate.

To keep these caches in sync, hardware processors use a cache coherency protocol. The standard protocol on modern processors is MESI (Modified, Exclusive, Shared, Invalid), named after the four states a cache line can occupy:

  • Modified (M): The cache line is present only in the current core's cache, and its data has been edited (it is dirty relative to main memory). The current core must write the data back to memory or forward it to another core before the line can be evicted.
  • Exclusive (E): The cache line is present only in the current core's cache and matches the data in main memory.
  • Shared (S): The cache line is present in multiple cores' caches and matches the data in main memory. The line is read-only in this state; a core must transition the line to the Modified state before writing to it.
  • Invalid (I): The cache line does not contain valid data. Accessing this line triggers a cache miss, requiring a read from a shared cache or main memory.
MESI Protocol: Cache Line State Transitions CPU Core 1 L1 Cache (64-Byte Line) State: Modified (M) Value dirty, Core 1 owns exclusivity Invalidate Line CPU Core 2 L1 Cache (64-Byte Line) State: Invalid (I) Must read from memory on next access Shared L3 Cache / Main Memory Data block address: 0x7ffd98 Core 1's edit must be flushed here before Core 2 reads it

When a core modifies a shared cache line, it broadcasts an invalidation signal across the hardware interconnect bus. Other cores caching that line must set their local copies to the Invalid state. Only when these cores acknowledge the invalidation can the writing core modify the data and mark the line as Modified.


False Sharing

The MESI protocol operates on whole cache lines, not individual variables. This design quirk introduces a performance issue known as false sharing.

False sharing occurs when two threads running on different cores modify independent variables that happen to reside in the same 64-byte cache line:

text
            ┌──────────────────────────────────────────────┐
            │            Shared 64-byte Cache Line         │
            │  [ Variable X (Core 1) ] [ Variable Y (Core 2) ]  │
            └──────────────────────┬───────────────────────┘

                    (Both cores edit concurrently)

            ┌──────────────────────────────────────────────┐
            │   MESI Protocol forces continuous bounces    │
            │   and invalidation loops between L1 caches   │
            └──────────────────────────────────────────────┘

Even though the threads do not share data, the hardware treats the entire cache line as shared.

When Core 1 writes to variable X, it invalidates the entire cache line on Core 2. When Core 2 writes to variable Y, it invalidates the cache line on Core 1. The cache line bounces between the cores' private caches, degrading performance.


Memory Barriers and Store Buffers

To keep write operations fast, CPUs do not wait for invalidation acknowledgments to complete. Instead, they use hardware optimizations:

  • Store Buffers: When a core writes data, it writes to a store buffer and immediately continues executing instructions. The data is flushed to the cache once invalidation acknowledgments arrive.
  • Invalidation Queues: Cores queue incoming invalidation requests in an invalidation queue, acknowledging them immediately and processing them asynchronously.

While these structures improve performance, they can lead to out-of-order execution, where writes become visible to other cores in a different order than they were written.

To prevent this, engineers use memory barriers (or fences) in multi-threaded code. A memory barrier forces the CPU to flush its store buffer and process its invalidation queue, ensuring memory operations are visible in the expected order across all cores.


Mitigating False Sharing with Data Alignment

You can prevent false sharing by ensuring that independent variables accessed by different threads are allocated on separate cache lines.

In C++11 and later, you can use the alignas specifier to align variables to cache line boundaries:

cpp
struct OptimizedData {
    alignas(64) uint64_t thread1_count;
    alignas(64) uint64_t thread2_count;
};

This instructs the compiler to insert padding bytes so that each variable starts on a new 64-byte boundary, isolating them on separate cache lines. While this uses slightly more memory, it eliminates false sharing and prevents invalidation loops.


Further Reading

Code Examples

Core Literature References

Computer Architecture: A Quantitative Approach

by John L. Hennessy and David A. Patterson — Chapter 5: Thread-Level Parallelism, pp. 348-392

View source

Is Parallel Programming Hard, And, If So, What Can You Do About It?

by Paul E. McKenney — Appendix C: Why Memory Barriers Are Required, pp. 389-410

View source