TCP Keepalive Is Not What You Think

I've been debugging network issues in production for longer than I'd like to admit. Last year, I watched a service eat 40% more CPU than expected because of a keepalive configuration nobody understood. The year before that, I found a critical service silently dropping connections after 2 hours because someone set tcp_keepalive_time to 7200 and called it a day.

TCP keepalive is one of those features that looks simple on the surface but has enough gotchas to fill a small book. Most engineers configure it once, copy-paste some values from Stack Overflow, and never think about it again. Until something breaks at 3 AM.

Let's dig into what keepalive actually does, because I promise it's not what you think.

What Keepalive Is Actually For

The core idea: TCP keepalive lets you detect if a peer has disappeared without sending real data. If a connection sits idle for too long, the kernel will send probe packets to see if the other side is still there.

Sounds simple. It's not.

The keepalive mechanism has three tunable parameters (on Linux, at least):

tcp_keepalive_time    - How long to wait before starting keepalive probes
tcp_keepalive_intvl   - Time between individual probes  
tcp_keepalive_probes  - Number of probes before giving up

The default values on most Linux systems:

tcp_keepalive_time = 7200       (2 hours!)
tcp_keepalive_intvl = 75
tcp_keepalive_probes = 9

Here's what that means in practice: you open a connection, it goes idle, and the kernel waits 2 hours before even starting to check if the other side is alive. Two. Full. Hours.

If you're running a long-lived connection that needs to detect failures faster, these defaults will betray you completely.

The Probe Sequence Nobody Explains Correctly

Let me walk you through exactly what happens when keepalive triggers:

Connection idle for tcp_keepalive_time (default 7200 seconds)
Kernel sends first keepalive probe
If no response, wait tcp_keepalive_intvl seconds, send another probe
Repeat until you've sent tcp_keepalive_probes probes
If still no response, kill the connection

So with defaults, after 2 hours of idle time, you get:

Probe 1: after 7200 seconds
Probe 2: after 7200 + 75 seconds
Probe 3: after 7200 + 150 seconds
...
Probe 9: after 7200 + 600 seconds

Total time to detect a dead connection: 7800 seconds (about 2 hours and 10 minutes).

This is unacceptable for most production services. Your Redis connection drops, your API gateway holds a stale connection, and your users get mysterious 5xx errors until something notices and reconnects.

The SO_KEEPALIVE Trap

Here's where it gets weird. Most people set SO_KEEPALIVE on their socket and assume they're done:

int yes = 1;
setsockopt(sockfd, SOL_SOCKET, SO_KEEPALIVE, &yes, sizeof(yes));

This enables keepalive, but it uses those terrible defaults. You can adjust them with additional options:

// Set the idle time before probes start
int idle = 60;  // 60 seconds
setsockopt(sockfd, IPPROTO_TCP, TCP_KEEPIDLE, &idle, sizeof(idle));

// Set interval between probes  
int interval = 10;
setsockopt(sockfd, IPPROTO_TCP, TCP_KEEPINTVL, &interval, sizeof(interval));

// Set number of probes before giving up
int probes = 3;
setsockopt(sockfd, IPPROTO_TCP, TCP_KEEPCNT, &probes, sizeof(probes));

With these values, you detect a dead connection in roughly 60 + (3 * 10) = 90 seconds instead of 2 hours. Much better. But wait...

The Real Problem: Bandwidth and CPU

Here's the part nobody talks about in those Stack Overflow answers.

Every keepalive probe sends a TCP packet. The peer must respond. This is extra network traffic on your connection, extra CPU processing on both ends, and extra latency if you're trying to do low-latency networking.

I benchmarked this once on a high-throughput service. With keepalive probing every 75 seconds (the default interval), we were generating roughly 1 probe per second across our connection pool. Not huge, but it adds up when you're running thousands of connections.

More importantly: the kernel has to wake up to process these probes. If you're in a latency-sensitive path, keepalive probes can add jitter. I've seen this cause occasional spikes in p99 latencies on services that shouldn't have any business sending packets during idle periods.

What About HTTP/2 and gRPC?

This is where things get genuinely confusing. If you're using HTTP/2 or gRPC over TCP, you might think TCP keepalive covers you. It doesn't.

HTTP/2 has its own keepalive mechanism (often called connection-level keepalive). The distinction matters:

TCP keepalive: detects if the TCP connection is dead
HTTP/2 ping: detects if the application-level peer is alive

If your application freezes but the TCP socket is technically alive (no network failure), TCP keepalive won't help you. The kernel sees an alive peer because packets are being exchanged, but your service is hung.

This is why gRPC clients typically implement their own application-level health checks. They don't trust TCP keepalive to tell them if the service is actually responsive.

The SO_REUSEADDR Gotcha (Bonus Round)

While we're on the topic of socket options nobody understands: SO_REUSEADDR.

This is commonly explained as "let you bind to a port that's in TIME_WAIT state." That's true, but it's not the full story.

SO_REUSEADDR also lets you bind to an address that's currently in use by another socket in certain states. This is intentional for multi-homed hosts and certain load balancing scenarios, but it can bite you if you're not careful.

I once spent two days debugging a "port already in use" error that turned out to be SO_REUSEADDR misbehaving on a system with multiple network interfaces. The socket bound to the wrong address, and traffic went somewhere unexpected.

The lesson: understand what each socket option actually does, not what the one-line Stack Overflow answer says it does.

What You Should Actually Do

Here's my practical take, based on shipping systems that need to detect failures quickly:

For short-lived request/response services: Don't use TCP keepalive at all. If your connection is idle, you probably want to close it and reconnect when needed. The overhead of keepalive isn't worth it.

For persistent connections (databases, message queues): Use keepalive with values tuned to your SLA. If you need sub-minute failure detection:

setsockopt(sockfd, IPPROTO_TCP, TCP_KEEPIDLE, &(int){10}, sizeof(int));
setsockopt(sockfd, IPPROTO_TCP, TCP_KEEPINTVL, &(int){5}, sizeof(int));
setsockopt(sockfd, IPPROTO_TCP, TCP_KEEPCNT, &(int){3}, sizeof(int));

This gives you detection in roughly 25 seconds.

For long-lived stateful connections: Consider application-level heartbeats instead. They give you more flexibility (you can include payload data, sequence numbers, round-trip time measurements) and they work correctly even when TCP keepalive wouldn't trigger.

For everything: Measure the actual impact. Enable keepalive metrics, watch your CPU usage, check your packet rates. Don't assume the defaults are fine just because nobody's complained yet.

The Bottom Line

TCP keepalive is a blunt instrument. It detects one very specific failure mode (the peer has vanished from the network) and misses many others (peer is frozen, peer is overloaded, network is partitioned at the application layer).

Use it as part of a layered approach: TCP keepalive for basic liveness, application-level health checks for actual responsiveness, and good observability so you know when things are failing before your users do.

And for the love of all that is holy, change those defaults. 7200 seconds is not a reasonable idle timeout for any modern service.

If you've got a production horror story involving keepalive misconfiguration, I've got about six of them sitting in my notes from the last decade. The pattern is always the same: someone copied an example, the example used defaults, and then something subtle broke at the worst possible time.

The defaults exist for a reason (minimizing unnecessary traffic on idle connections), but production services are rarely truly idle. Know what you're optimizing for.