Why Your CPU's Memory Model Is Secretly Destroying Performance

I once spent three weeks hunting a bug that only appeared on specific ARM hardware, never on our development x86 machines. The symptom was simple: a counter that occasionally showed values that were mathematically impossible. Three weeks. The culprit? I had assumed that because one thread wrote a value and another thread read it moments later, the read would see the written value.

That assumption was, to put it diplomatically, completely wrong.

The reality is that modern CPUs are not the simple sequential machines that programming languages pretend they are. They reorder operations, execute instructions out of order, maintain multiple views of memory in different cores' caches, and do everything they can to keep their execution units fed with work. Understanding why they do this and how they do it is not academic navel-gazing—it directly affects whether your concurrent code works at all, and whether it performs when it does.

Let me drag you through the ugly truth.

The Lie We Tell Ourselves

When you write this C++ code:

std::atomic<int> ready{0};
int data = 42;

void writer() {
    data = 1337;      // Store 1
    ready.store(1);    // Store 2
}

void reader() {
    if (ready.load()) {    // Load A
        assert(data == 1337);  // Load B
    }
}

You probably expect that if reader() sees ready as 1, then data must be 1337. The store to data happened before the store to ready, so logically data should have the new value when the load from ready returns true.

Except on a weakly ordered CPU like ARM or RISC-V, this code can fail. The CPU is allowed to execute Store 2 before Store 1, or to let the read of data happen before the read of ready. You added a dependency—store 2 signals "data is ready"—but the CPU doesn't care about your logical dependencies. It cares about keeping its pipelines full.

This is not a bug in your compiler or your CPU. It's a deliberate design trade-off. Strong memory ordering (what x86 provides) makes hardware harder to design and typically slower for some workloads. Weak ordering (what ARM provides) gives architects more freedom to build faster processors.

The question is: do you know which memory model your code runs on?

The Cache Coherency Problem Nobody Talks About

To understand memory ordering, you need to understand what actually happens when you write to memory.

When a CPU core writes to a variable, it doesn't immediately write to main RAM. Main RAM is agonizingly slow—a DDR4-3200 has latency around 60-90 nanoseconds. Your CPU executes instructions in nanoseconds. Writing to RAM would stall the pipeline for hundreds of cycles.

So CPUs have caches. L1 cache is fast (4-6 cycles), L2 is slower (12-20 cycles), L3 is shared across cores and slower still. When you write a value, you write it to your core's cache. The value lives there. Other cores don't see it.

Here's where it gets fun: every core has its own view of memory, cached in its own caches. If core 0 writes to address 0x1000 and core 1 reads from 0x1000, core 1's cache doesn't automatically update. Core 1 might have an old stale value in its cache.

Your CPU solves this with cache coherency protocols—MESI (Modified, Exclusive, Shared, Invalid) and its descendants. When core 0 writes to 0x1000, it broadcasts an invalidation to all other cores. If core 1 has that line cached, it gets invalidated. Core 1's next read will miss cache and fetch the new value from core 0.

But this invalidation propagation is not instantaneous. It's fast, but it's not zero time. And within that window, things can go wrong.

Store Buffers: The Silent Order Changer

Modern CPUs are massively parallel internally. A single core has multiple execution units. While one unit is waiting for a cache miss, another can be doing useful work.

But here's a problem: what happens when you issue a store instruction? You want to write to memory (well, to your cache). If that cache line is owned by another core (in Modified state), you need to wait until you get it. That could take dozens of cycles.

So CPUs add a store buffer: a small FIFO between the execution unit and the cache. When you execute a store, the value sits in the store buffer. The execution unit can immediately continue. When the cache line becomes available, the store drains from the buffer and commits to cache.

Simple, right? Here's the problem: that store is now visible to your core but not to other cores. From the perspective of the rest of the system, it hasn't happened yet.

On a strongly ordered CPU like x86, the store buffer is invisible to software. Stores are ordered—the CPU ensures that stores appear in program order globally. On a weakly ordered CPU like ARM, stores can be reordered, and the store buffer is a big reason why.

// On ARM, this can execute as:
// Store 1: ready = 1
// Store 2: data = 1337
// ... and a concurrent reader might see ready=1 but data=0

// To fix this, you need memory barriers:
void writer_fixed() {
    data.store(1337, std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    ready.store(1, std::memory_order_relaxed);
}

void reader_fixed() {
    if (ready.load(std::memory_order_relaxed)) {
        std::atomic_thread_fence(std::memory_order_acquire);
        assert(data.load(std::memory_order_relaxed) == 1337);
    }
}

The release fence forces all prior stores to be globally visible before the store to ready. The acquire fence forces all subsequent loads to see stores that happened before the release. This creates a happens-before relationship that the CPU must respect.

The Store-Load Reordering Trap

One of the subtlest reorderings is store-load reordering. A CPU can execute a later load before an earlier store completes, if the store is to a different address.

std::atomic<bool> init_flag{false};
std::atomic<int> config{0};

void setup() {
    config.store(0xDEAD);  // Store A
    init_flag.store(true);  // Store B
}

void work() {
    if (init_flag.load()) {   // Load C
        int c = config.load();  // Load D
        // Load D might execute BEFORE Store A!
        // If the cache line for 'config' is in cache but 'config' in store buffer
        // the CPU might use the stale cached value
    }
}

This looks idiotic when written out, but in real code this pattern hides in the shadows. I found it in production code where a configuration structure was set up by one thread and read by worker threads. The "obvious" fix was adding a memory barrier, but the original programmer (not unreasonably) thought "the CPU will see the stores in order."

No. It won't.

Why This Matters More Than You Think

"But wait," you might say, "I write in Java/Python/Go. The runtime handles this."

Does it? Partially. Managed runtimes often insert memory barriers at synchronization points. But:

Explicit atomic operations: If you use AtomicInteger in Java or sync/atomic in Go, you're back to memory ordering considerations.
JIT compilation: The JVM can reorder operations unless you use volatile (which maps to acquire/release semantics) or explicit VarHandles with memory orderings.
Unsafe code paths: FFI calls, native libraries, and unsafe code segments often bypass the runtime's memory model guarantees.
Performance: Even if your code is correct, you might be paying synchronization costs you don't need. If you're on x86 and using memory_order_seq_cst (sequentially consistent) everywhere because you didn't understand the memory model, you're leaving performance on the table.

Let me show you what I mean. I once profiled a high-throughput message processing system where a single mutex protecting a counter was a significant bottleneck. The fix was replacing the mutex with std::atomic<uint64_t> and using fetch_add with relaxed ordering:

// Before: mutex lock/unlock + increment
// After:
std::atomic<uint64_t> counter{0};

// Increment (called millions of times per second):
counter.fetch_add(1, std::memory_order_relaxed);  // No synchronization needed

On x86, relaxed ordering for an atomic increment is just a single lock add instruction—no heavier than what the mutex was doing. But we removed cache-line bouncing between cores (the mutex was a separate cache line) and eliminated lock contention. Throughput doubled.

The trick: we didn't need any memory ordering guarantees. Each thread was incrementing a separate counter that would be summed later, or the counter's exact value at any given moment didn't matter. We just needed the total to be correct eventually. Relaxed ordering gave us that for free.

The Cache Line Ping-Pong Disaster

Here's a performance issue that isn't about correctness but will ruin your day anyway: false sharing.

Remember that cache coherency operates on cache lines, not individual variables. A cache line is typically 64 bytes. If two cores are modifying variables that happen to share a cache line, every write by one core invalidates the other's cache line. The second core has to re-fetch the line from the first core's cache.

struct [[gnu::aligned(64)]] Counter {
    std::atomic<uint64_t> value{0};
};

Counter counters[16];  // 16 cores, each gets its own cache line

void increment(int id) {
    counters[id].value.fetch_add(1, std::memory_order_relaxed);
}

Without the aligned(64) attribute, two adjacent Counter objects might share a cache line. You'd see throughput collapse as cores steal cache lines from each other.

I diagnosed this in a metrics collection system. The symptom: adding more worker threads made the system slower. The cause: four threads updating four metrics that happened to share a cache line. Each increment triggered an invalidation on the other cores' caches. The CPUs were spending most of their time waiting for cache transfers, not doing actual work.

The fix: padding and alignment to ensure each atomic counter lived on its own cache line.

What You Should Actually Do

After years of fighting these issues, here's my practical advice:

1. Default to memory_order_seq_cst unless you have a reason not to. Sequential consistency is the simplest mental model, and correctness beats micro-optimization. Only drop to acquire/release or relaxed when profiling shows contention and you understand the correctness implications.

2. Learn to read assembly for your target architecture. On x86: look for lock prefix instructions. On ARM: look for dmb ish (data memory barrier, inner shareable). The assembly tells you what's actually happening.

3. Use tools. ThreadSanitizer catches data races. clang's -fsanitize=thread has saved me more than once. For low-level work, RCU implementations and hazard pointers are your friends.

4. Understand your hardware's memory model. If you're targeting ARM/RISC-V servers (common in mobile and increasingly in cloud), you cannot assume strong ordering. x86 is forgiving in ways ARM is not.

5. Profile before and after. Memory ordering changes affect performance in non-obvious ways. A barrier might seem "free" on one architecture and devastating on another.

The next time you write concurrent code, pause for a moment. Think about where your stores are going, what your loads might see, and whether the CPU you're running on respects the assumptions in your head. Because I'll guarantee this: the CPU doesn't read your comments.

It just executes.

And sometimes, what it executes is nothing like what you intended.

If you found this useful, you probably need to go audit your atomics usage right now. Just saying.