Why Your Containers Are Lying to You About Memory

Last month, I got pulled into a P0 incident at 2 AM. A service was getting OOM-killed in production despite showing healthy memory metrics in our monitoring dashboard. The container had a 512MB limit. Our monitoring showed 400MB used. The kernel disagreed—violently—and killed the process anyway.

This happens more than you think. And the reason reveals something important: we've built an entire industry around containers while most engineers genuinely don't understand how memory accounting works beneath the abstraction.

Let me fix that.

The Prison You're Running Inside

When you run a container with docker run -m 512m, you're not just setting a suggestion. You're telling the Linux kernel to create a cgroup (control group) and constrain this process to a specific memory limit. But here's what surprises most people: the kernel measures memory differently than your application does, and differently than your monitoring tools do.

The kernel tracks what's called real memory usage, which includes:

The actual RSS (Resident Set Size) of your process
Page cache used by files the process has mapped
Kernel data structures related to the process
Memory that was allocated but never touched (lazy allocation)
Slab allocator objects

Your application measures... whatever the runtime decides to expose. In Java, that's heap usage. In Python, that's sys.getsizeof() of your objects plus interpreter overhead plus... whatever the GC thinks is still around. In C, that's whatever malloc tracked internally, which might be wildly different from what the kernel sees.

These numbers are not the same thing.

I tested this on a simple Go application that allocates 100MB of memory but never writes to it:

func main() {
    // Allocate but don't touch - Linux won't actually reserve this
    allocations := make([][]byte, 100)
    for i := 0; i < 100; i++ {
        allocations[i] = make([]byte, 1024*1024) // 1MB each
    }
    
    fmt.Println("Allocated. Press Enter to touch memory...")
    fmt.Scanln()
    
    for i := 0; i < 100; i++ {
        for j := 0; j < len(allocations[i]); j++ {
            allocations[i][j] = byte(j % 256)
        }
    }
    
    fmt.Println("Touched. Press Enter to exit...")
    fmt.Scanln()
}

Run this inside a container with a memory limit and watch /sys/fs/cgroup/memory.max. The kernel won't count that first 100MB because Linux uses demand paging. The memory isn't physically allocated until your code actually touches it.

But here's the catch: once you touch it, you're committed. And this is where production systems get killed unexpectedly.

The Transparent Hugepage Trap

Modern Linux kernels use transparent hugepages (THP) to improve performance by combining smaller pages into 2MB huge pages. This is generally great for your applications. It's absolutely terrible for containers with memory limits.

Why? Because THP allocation happens asynchronously. When your process requests memory, the kernel might kick off a hugepage promotion in the background. This background allocation counts against your cgroup limit even though your process doesn't see it as "used" yet.

The result: your monitoring shows 450MB, kernel thinks it's at 512MB, OOM killer fires.

You can see this in action. Check your THP status:

cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

And watch what happens when you trigger THP allocations in a container:

# In one terminal, watch cgroup memory
while true; do 
    cat /sys/fs/cgroup/memory.max 2>/dev/null || cat /sys/fs/cgroup/memory.current
    sleep 0.1
done

# In another, run a workload that triggers THP
docker run --rm -m 512m --memory-swappiness=0 \
    progrium/stress-ng --hugepages 100

What you'll see: the kernel memory counter jumps in ways that don't match what your application reports.

The fix if you're running latency-sensitive workloads or containers with tight memory constraints: disable THP inside the container namespace.

# Inside your container or via systemd
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled

This costs you some performance. But for database workloads and other memory-sensitive services, it's often worth it for the predictability.

cgroup v2: The Silent Architecture Change

If you're running on a modern Linux distribution (Ubuntu 22.04+, RHEL 9+, any recent Kubernetes), you're probably using cgroup v2. And if your mental model is still based on cgroup v1, you're probably wrong about how things work.

In cgroup v1, each controller (memory, cpu, io) was a separate hierarchy. You could have different processes in different cgroups for different controllers. This was flexible but led to weird edge cases.

In cgroup v2, there's a single unified hierarchy. This means memory limits affect the entire container, including all processes spawned by your main process. This sounds obvious, but it breaks common patterns.

Consider this: you run a container that spawns a sidecar process. In cgroup v1, you might have separate memory accounting. In cgroup v2, both processes count against the same limit. Your monitoring tool that reports memory from the main process's ps output? That's not accounting for the sidecar.

I found this in our production system last quarter. We had a Python service that spawned a metrics exporter as a subprocess. The Python process reported 300MB. The subprocess added another 80MB. The container limit was 512MB. We were getting OOM-killed when the combined usage hit the limit, but our per-process metrics looked fine.

The right way to see actual container memory:

# For cgroup v2
cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.current

# For cgroup v1
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes

Or use docker stats:

docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}"

That MEMUSAGE column? That's what the kernel actually sees. Not what your application sees. Not what your application metrics show.

The Swap Problem Nobody Talks About

Here's one that bites production systems regularly: swap usage isn't counted the same way across all tools.

In cgroup v2, you can set memory.swap.max to limit swap usage. Set it to max to allow full swap. Set it to 0 to disable swap entirely.

But here's the gotcha: if you disable swap (swap.max=0), your container will hit OOM faster. That sounds fine until you realize that Linux's OOM killer works differently in a no-swap environment.

With swap available, the kernel can page out inactive pages, buy time, and give your process a chance to handle memory pressure gracefully. With swap disabled, when you hit the limit, you're at the limit. The OOM killer fires immediately.

For Java applications with GC, this is particularly nasty. GC pause times can spike memory usage temporarily. With swap, the kernel might page out some heap during the pause. Without swap, that spike pushes you over the limit.

The fix: set your memory limit slightly higher than your expected peak, and consider enabling some swap for production workloads.

# Kubernetes example - give yourself headroom
resources:
  limits:
    memory: 512Mi  # But your app might need 600Mi during GC spikes

Or use Java-specific flags to pre-touch heap pages and avoid THP interference:

JAVA_OPTS="-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages"

Wait, you wanted to disable THP for memory predictability, but now Java wants it enabled for performance? Welcome to the fun part of systems engineering.

The Slab Cache Problem

Every Linux system allocates memory through the slab allocator for kernel objects. These include things like file handles, network buffers, and inode caches. These allocations count against your container's memory limit, but they're controlled by the kernel, not your process.

Run this test. Create a container with a 256MB limit. Inside it, create 100,000 files.

docker run --rm -m 256m -it ubuntu:22.04 bash

# Inside container
for i in $(seq 1 100000); do
    touch /tmp/testfile_$i
done

Now check memory usage:

# From host
cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.current

# Inside container  
free -m
cat /proc/meminfo | grep -E "(SReclaimable|MemFree)"

The kernel's filesystem cache for those inodes is eating into your 256MB limit. Your application might report 50MB used. The kernel sees 256MB+. OOM kill incoming.

This is why certain workloads—logging sidecars, proxies, anything that does heavy filesystem I/O—need larger memory limits than you'd expect from their application-level metrics.

What Actually Helps

After debugging these issues across multiple production systems, here's what actually works:

1. Monitor at the cgroup level, not the application level.

Use tools that read from /sys/fs/cgroup/ directly. Prometheus has container_memory_usage_bytes which does this. Your application metrics are lying to you.

2. Add 20-30% headroom to your memory limits.

Memory limits aren't a target—they're a ceiling. Your application will have GC spikes, THP allocations, filesystem cache, and other things the kernel accounts that your application doesn't.

3. Test memory behavior under real workloads before production.

Run your container at limit with production-like traffic patterns for at least 24 hours. Watch both application metrics AND cgroup metrics. You want to find the memory ceiling, not discover it during an incident.

4. Know your allocator's behavior.

If you're running Rust's jemalloc, Go's tcmalloc, or Java's GC, each handles memory differently. jemalloc tends to cache and report usage differently than the kernel sees. Go's runtime holds onto memory and gives it back reluctantly. These aren't bugs—they're tradeoffs.

5. Use memory limits as guardrails, not targets.

Set limits to protect the node, not to optimize your application. Let your application use what it needs, up to the limit. If it's regularly using 95% of its limit, that's fine. If it's regularly getting OOM-killed, that's your signal to investigate.

The Real Takeaway

Containers gave us a wonderful abstraction: package your application, set some limits, run it anywhere. But abstractions leak. And the memory accounting abstraction leaks in ways that will silently kill your production processes if you don't understand what's underneath.

The kernel isn't guessing. It's always right about how much memory your container is using. The problem is that we've built monitoring tools and mental models that measure something different.

Next time you're debugging an OOM kill, don't trust your application metrics. Trust the kernel. It's keeping the real books.

And maybe set that memory limit a little higher. Just in case.

If you found this useful, you probably already have some war stories about container memory. The kind where you stared at dmesg output at 3 AM wondering why a process with "300MB used" got killed when the limit was "512MB." You're not alone. The kernel was right. We just weren't measuring the right thing.