← Back to Blog

Your Linux Server Is Lying to You About Memory

· 5 min read
linux memory-management performance debugging systems-engineering operations page-cache kernel

Your Linux Server Is Lying to You About Memory

I spent three years thinking I understood Linux memory. I was wrong.

It took a 2 AM production incident where our Java service kept getting OOM-killed to make me actually read the code. Not documentation—the actual kernel source. What I found reshaped how I think about every Linux system I've touched since.

Let me save you the three years.

The Scenario Everyone Recognizes

You deploy a new service. It runs fine for a week. Then, gradually, memory usage climbs. You check free -h:

              total        used        free      shared  buff/cache   available
Mem:            62Gi       58Gi       1.2Gi       200Mi        28Gi       3.1Gi

You panic. "Only 3.1Gi available, we're almost out of memory!" You start killing processes, adding swap, or—worse—blindly adding RAM.

But here's what's actually happening: that 28Gi in buff/cache is your page cache. It's not memory being "used" in the traditional sense. It's Linux being smart—holding file data in RAM because disk is slow and you might need it again.

The available figure is what matters for new allocations. But even that number can mislead you if you don't understand what it's telling you.

The Page Cache Is Not Your Enemy

Linux has a simple philosophy: unused RAM is wasted RAM.

When you read a file, the kernel doesn't just give you the data and forget about it. It caches the pages in RAM. Read that file again? It's already in memory. That's why cat on a file you just grepped feels instant—it's coming from RAM, not disk.

The same happens with writes. Write to a file? Linux often writes to the page cache first, marking the pages as "dirty." The actual disk write happens later, in the background. This is called write-back caching, and it's why you can lose data if you sync before the kernel flushes.

Here's what trips people up: the kernel will reclaim page cache pages when applications need RAM. It's not stealing from your processes. It's a smart allocation strategy.

Page cache memory
├── File-backed pages (can be reclaimed instantly)
├── Anonymous pages (swap-backed, harder to reclaim)
└── Shared memory regions

When you see buff/cache climbing, that's typically page cache growing to use "available" memory. When a process needs RAM, the kernel evicts the least-recently-used page cache pages first. This is all working exactly as designed.

Where Things Get Weird: Memory Pressure and Reclaim

The problem starts when you have multiple processes competing for memory and the page cache is growing.

Here's a scenario I ran into: a logging service that writes heavily to disk. The page cache grows to cache those writes. Then our main application starts needing more memory. The kernel has to reclaim pages to satisfy the request.

There are two kinds of reclaim:

  1. Direct reclaim: A process asks for memory, nothing's free, the kernel steals pages right there in that process's context. This is slow—your process is waiting for the kernel to free memory before it can continue.

  2. kswapd: The kernel daemon that runs when memory gets low (typically when free memory drops below a "low watermark"). It runs asynchronously, trying to keep free memory above "min watermark."

When kswapd can't keep up, you hit direct reclaim. This is where latency spikes happen. This is where your P99s go to hell.

The /proc/meminfo metrics that matter:

Active(file):         Pages actively used, can be reclaimed
Inactive(file):       Pages not recently used, first to reclaim
Active(anon):         Anonymous pages (heap, stack), not file-backed
Inactive(anon):       Anonymous pages eligible for swap
Shmem:                Shared memory (tmpfs)
SReclaimable:         Reclaimable slab memory
SUnreclaim:           Unreclaimable slab memory

Run this on your servers and actually read the output:

cat /proc/meminfo | grep -E '^(Active|Inactive|Shmem|SReclaimable|Buffers|Cached|AnonPages|Writeback)'

Understanding these numbers tells you where your memory is actually going.

The OOM Killer Is Not Random

When memory is truly exhausted and the kernel can't reclaim enough, the OOM killer activates. Most people know this. What they don't know is that it's not random, and you can influence it.

Each process has an oom_score based on:

  • Memory usage
  • Process age (older = lower score, more "important")
  • oom_score_adj (-1000 to +1000, configurable)

You can tune this:

# Make a process "too important to kill"
echo -900 > /proc/$PID/oom_score_adj

# Make it a preferred target
echo 500 > /proc/$PID/oom_score_adj

Or in systemd:

[Service]
OOMScoreAdjust=-1000

But here's what actually matters: if you're getting OOM-killed, you're asking the wrong question. The question isn't "which process should die?" The question is "why is memory being exhausted in the first place?"

Memory leaks. Memory bloat. Working sets larger than RAM. These are the real problems. The OOM killer is just the symptom.

Practical Debugging: What I Actually Use

After years of production incidents, here's my debugging workflow:

Step 1: Get the real picture

# This shows actual memory consumption excluding page cache
ps aux --sort=-%mem | awk 'NR==1; $4 > 1.0' | head -20

# Check for huge processes that don't match expected memory
smem -r -k

# See what the kernel thinks about memory pressure
cat /proc/meminfo | grep -E '^(MemFree|MemAvailable|Buffers|Cached|Active|Inactive)'

Step 2: Understand the pressure

# vmstat with 1-second intervals shows reclaim activity
vmstat 1

# Look for non-zero "si" (swap in) and "so" (swap out) columns
# Any regular swapping is a problem

Step 3: Find the leaks

# In Python/Java/Go? Use their memory profilers.
# But also check for:
# - fd leaks (lsof or /proc/$PID/fd)
# - mmap'd files
# - Memory that isn't shown in ps (POSIX shm, etc.)

# Check for mapped files
cat /proc/$PID/maps | head -20
cat /proc/$PID/smaps | grep -E '^(Size|Rss|Pss|Shared_Clean|Shared_Dirty|Private_Clean|Private_Dirty|Swap)'

Step 4: The one-liner I use constantly

# Shows actual memory usage by process, including shared memory
echo "PID    RSS    SHR    CMD" && ps aux --sort=-rss | awk '{print $2, $6, $8, $11}' | head -20

The SHR column matters more than people realize. Two processes sharing 10Gi of shared libraries only "cost" that 10Gi once in total RSS. But if you just add up RSS, you'll think you're using way more than you are.

The swap misconception

Everyone tells you "avoid swap." They're half right.

You want to avoid swapping—the active movement of pages to disk because RAM is full. That's terrible for performance.

But a little swap being used for rarely-accessed anonymous memory? That's fine. It lets the kernel keep more of the page cache hot, which is where you actually care about performance.

What you want to avoid is:

  • Constant swap in/out (thrashing)
  • Non-zero si/so in vmstat over sustained periods
  • Swap being used while page cache is small (that means you're short on RAM)

What you don't need to panic about:

  • A few gigabytes of swap used
  • swappiness settings (unless you're doing specific tuning)
  • Swap used while page cache is healthy

Working Set Size: The Metric Nobody Tracks

Here's the thing that finally made it click for me: your application doesn't use X GB of memory. Your application accesses Y GB of memory regularly.

That 8GB Java heap? Maybe only 3GB is actively in use. The rest is loaded objects from old requests that just haven't been GC'd yet. The kernel can't know this—it only sees the allocated pages, not which pages you're actually touching.

This is why working set size (WSS) matters and why you can't just look at RSS and know if you have a problem.

Tools like memusage (from glibc) or language-specific profilers can help. But the practical reality is:

# Check page faults - high numbers mean you're accessing memory the kernel had to fetch
# Major faults = had to read from disk
# Minor faults = had to map already-loaded pages
ps -o pid,min_flt,maj_flt,cmd $PID

If you have high major page faults, you're swapping (or accessing memory-mapped files that got evicted). If you have high minor faults, you're just causing the kernel to map pages—less of a problem, but still overhead.

The Practical Takeaway

Here's what I want you to remember:

  1. Page cache is not your enemy. It's free performance. Stop panicking when buff/cache grows.

  2. Learn to read /proc/meminfo. The kernel is telling you exactly what's happening. Most people just look at free.

  3. OOM kills are not random. Tune oom_score_adj if needed, but find the root cause.

  4. Working set size > total memory usage. An application using 60GB RSS might only be actively using 20GB. This is why overprovisioning isn't always the fix.

  5. Minor page faults matter more than you think. If your service is slow and has high fault rates, your memory access patterns might be the problem, not raw throughput.

The next time you SSH into a server and see free showing "low memory," don't panic. Read the full picture. The kernel has been doing this for decades—it knows what it's doing.

Your job is to understand what it's doing, not fight it.


If you've got a memory horror story of your own, you probably do. The systems we run are stranger than we admit.