Why Your Store-Buffer-Ordered Code Is Lying to You

I spent three weeks hunting a bug that only manifested on ARM production machines while passing every test on our Intel dev boxes. The test suite was green. The code looked correct. The production crash was real.

What I found taught me more about modern CPU architecture than five years of reading Intel manuals. The culprit? The store buffer hiding in plain sight on every modern processor.

The Store Buffer: Your CPU's Dirty Little Secret

Every x86 CPU has a store buffer. It's a small FIFO queue between your CPU's execution units and the memory subsystem. When your core executes a store instruction, it doesn't immediately write to L1 cache. Instead, it drops the value in this buffer and continues executing.

Why? Because cache lines are frequently owned by other cores or waiting to be written back. Waiting for cache to be available would stall the pipeline. So your CPU buffers the store and moves on. The store eventually propagates to cache (and thus to other cores), but "eventually" might be microseconds or even milliseconds later.

This is fine. This is normal. This is how CPUs have worked for decades.

But here's where it gets interesting.

The Classic Example That Breaks Your Intuition

Consider this C11 code:

#include <stdatomic.h>

int A = 0, B = 0;

void *thread0(void *arg) {
    store(&A, 1);        // Store to A
    load(&B);            // Load from B
}

void *thread1(void *arg) {
    store(&B, 1);        // Store to B
    load(&A);            // Load from A
}

After both threads complete, can we observe r1=0 and r2=0?

Intuitively, no. Thread 0 stores to A before reading B, and thread 1 stores to B before reading A. Sequential reasoning says at least one thread should see the other's store.

But on a weakly-ordered architecture like ARM or RISC-V, yes, both loads can return 0.

The execution goes like this:

Thread 0 issues store to A → goes into thread 0's store buffer
Thread 1 issues store to B → goes into thread 1's store buffer
Thread 0 loads from B → cache has 0, returns 0
Thread 1 loads from A → cache has 0, returns 0

Both stores are buffered, neither is globally visible, and both loads see stale values. The result: r1=0, r2=0. This isn't a bug. This is valid hardware behavior.

On x86, you'd typically see different results because x86 has a strong memory model. But "typically" isn't "always," and the store buffer still exists—you're just protected from most of its pathological behavior.

Why x86 Seems Safe (And Why That's Misleading)

x86 provides "total store order" (TSO). Stores from a single CPU appear in program order to other CPUs. This sounds like it solves our problem, and for many cases, it does.

But TSO doesn't eliminate the store buffer. It just guarantees that stores from the same CPU are ordered. The stores are still buffered. Other CPUs still don't see them until they propagate.

The x86 memory model allows loads to be reordered with respect to older stores from the same CPU. This is called "store buffer forwarding." When a CPU wants to load a value, it first checks the store buffer. If there's a matching pending store, it uses that value instead of going to cache.

Here's the scenario that actually happens on x86:

// CPU 0
store(&x, 1);           // Buffered
r1 = load(&y);          // Goes to cache, r1 = 0

// CPU 1
store(&y, 1);           // Buffered
r2 = load(&x);          // Goes to cache, r2 = 0

Both stores are in their respective buffers. Both loads hit cache (no matching entry in that CPU's store buffer). Result: r1=0, r2=0.

This is called the store buffer weakness of TSO. It's why even on x86, you need memory barriers if you care about cross-CPU ordering.

Memory Barriers: The Assembly of Concurrency

To fix our code, we need memory barriers (fences). These instructions drain the store buffer before allowing further operations:

void *thread0_fixed(void *arg) {
    store(&A, 1);
    atomic_thread_fence(memory_order_seq_cst);  // Drain store buffer
    load(&B);
}

On x86, atomic_thread_fence(memory_order_seq_cst) generates a MFENCE instruction. This forces all preceding stores to drain to cache (and become globally visible) before any subsequent loads execute.

On ARM, the same barrier generates multiple instructions because ARM's weaker model requires more work to enforce ordering.

The difference is stark. On x86, a seq_cst fence costs ~40-100 cycles. On ARM, it can cost 200+ cycles. This is why choosing the right memory ordering matters so much for performance.

The Memory Ordering Zoo

C11/C++11 gave us a vocabulary for this chaos:

memory_order_relaxed: No ordering guarantees. Just atomicity.
memory_order_acquire: Subsequent loads can't be reordered before this load.
memory_order_release: Prior stores can't be reordered after this store.
memory_order_seq_cst: Full sequential consistency. Everyone agrees on the total order.

Most developers use seq_cst everywhere because it's "safe." They're right, but they're leaving performance on the table.

A lock-free queue using seq_cst throughout might be 2-3x slower than one using acq_rel appropriately. I've seen production systems where this difference meant the difference between handling 100K ops/sec and 300K ops/sec.

But here's the trap: using the wrong ordering is a silent bug. Tests pass. Benchmarks pass. Production on ARM crashes. Why? Because the bugs only manifest under specific timing conditions that x86's stronger model makes unlikely.

Real Production Bug I Encountered

We had a lock-free MPMC queue that used memory_order_relaxed for a counter tracking the number of elements. Under heavy load on x86, it worked fine. On ARM (our containerized production environment), the counter would drift.

The issue: the relaxed loads could see stale values, causing the queue to think it was empty when it wasn't (or vice versa). The fix was switching to memory_order_acquire for the load side and memory_order_release for the store side.

The diff was trivial. The bug took three weeks to find because it only manifested under load on non-x86 hardware.

What About `volatile`?

Developers often ask me: "Can't I just use volatile for thread-safe communication?"

No.

volatile in C/C++ means the compiler won't optimize away accesses. It does not provide any memory ordering guarantees. The CPU can still reorder operations. The compiler can still reorder operations (subject to the as-if rule and volatile semantics, but memory ordering is a hardware concern).

On x86, volatile loads and stores tend to behave more safely due to the strong memory model. But on ARM, volatile gives you nothing you didn't have before.

Use C11 atomics, or use platform-specific intrinsics if you need precise control.

The Cache Coherency Confusion

Here's something that trips up even experienced engineers: MESI cache coherency doesn't guarantee what you think it guarantees.

MESI ensures that if CPU A writes to a cache line and CPU B later reads that line, CPU B will see the written value. Cache coherency is about when values propagate, not about when you observe the effects of your own writes.

The store buffer sits between the execution unit and the cache. MESI doesn't help you there. Your store is in the buffer, cache coherency is about cache lines, and the two are unrelated at this level.

This is the conceptual gap that causes so many concurrency bugs. Developers understand coherency (or think they do) but ignore the store buffer entirely.

Practical Takeaways

After years of debugging these issues, here's what I tell junior engineers:

If you can avoid shared mutable state between threads, do that. Lock-free is cool until you have to debug it at 3 AM.
Test on ARM if your production runs on ARM. This means ARM laptops, or cross-compile and test on ARM servers. The bugs exist on x86 but are far less likely to manifest.
Use the right memory ordering for your use case. If you're doing single-producer single-consumer with indices, relaxed might be fine. If you're publishing pointers to other threads, you need acq_rel at minimum.
Profile your memory ordering. Tools like perf can show you memory ordering penalties. If you're seeing unexpected stalls, check your fences.
Remember the store buffer exists. Every store is buffered. Every subsequent load might see stale data until the buffer drains.

The Bigger Picture

What surprises most developers is that correctness and performance are in tension here. The strongest memory ordering (seq_cst) is also the slowest. The fastest ordering (relaxed) provides almost no guarantees.

This is a fundamental tradeoff, not a bug. Hardware engineers made intentional tradeoffs to make CPUs faster. Memory ordering violations are a feature, not a defect—they enable optimizations that make single-threaded code faster.

The problem is that these optimizations assume single-threaded execution. When you add threads, you inherit the hardware's assumptions. Your job is to explicitly override those assumptions where you need stronger guarantees.

The store buffer will always exist. The weak memory models of ARM and RISC-V are here to stay. The only question is whether you understand what's happening underneath your code.

Most developers don't. Most tests don't catch it. Production does.

Your move.