Debugging TCP Keepalive Failures: Firewall Session Timeouts and Linux Server Configuration


4 views

When network teams insist "our firewall has no idle timeout" but connections still drop after exactly 40 minutes, you're likely dealing with an unacknowledged session tracking limitation. Many enterprise firewalls implement hard-coded session timeouts despite vendor claims to the contrary.

Your initial configuration (tcp_keepalive_time=300, tcp_keepalive_intvl=300, tcp_keepalive_probes=30000) worked because:

  • The 5-minute keepalive interval prevented NAT/firewall session table expiration
  • Extremely high probe count effectively made connections persistent

The problematic configuration (time=300,intvl=180,probes=10) reveals several firewall behaviors:

# Current problematic sysctl settings
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 180  
net.ipv4.tcp_keepalive_probes = 10

Firewalls often implement these invisible behaviors:

  1. TCP Middlebox Interference: Some firewalls strip or respond to keepalive packets themselves
  2. Asymmetric Session Tracking: Firewalls may track only client→server traffic as "activity"
  3. Proprietary Health Checks: Vendor-specific keepalive mechanisms override standard TCP

To confirm firewall interference:

# On Linux server:
tcpdump -ni any "tcp port 1025 and (tcp[13] & 0x7f != 0)"

# On client (if accessible):
tcpdump -ni any "tcp port 1025 and (tcp[13] & 0x7f != 0)"

Key findings to look for:

  • Missing keepalive packets on client-side captures
  • Unexpected RST packets after exactly 2400 seconds (40 min)
  • Firewall-generated ACKs instead of client responses

When standard TCP keepalive fails:

1. Application-Level Keepalive (for Teradata/SSH):

# Teradata-specific heartbeat (requires client modification)
HEARTBEAT 30; -- Send empty query every 30 seconds

# SSH configuration option:
ServerAliveInterval 240

2. Firewall Policy Workarounds:

# iptables workaround for outbound connections
iptables -I OUTPUT -p tcp --dport 1025 -j ACCEPT
iptables -I INPUT -p tcp --sport 1025 -m state --state ESTABLISHED -j ACCEPT

Common firewall vendors with known TCP session issues:

Vendor Default Timeout Hidden Setting
Palo Alto 30 min tcp-timeout
Cisco ASA 60 min timeout conn
FortiGate 3600s set timeout-policy

The most reliable solution is to coordinate with network teams to:

  1. Identify the actual session timeout value
  2. Configure keepalive intervals to 50-75% of that value
  3. Implement bidirectional application heartbeats

When our Teradata database connections started dropping like flies after exactly 40 minutes of inactivity, we initially suspected the firewall's idle timeout. But the network team insisted their firewall had no such timeout configured. This led us down a rabbit hole of TCP keepalive tuning and packet analysis.

Our first successful configuration used extremely aggressive keepalive settings:

# sysctl settings that maintained connections indefinitely
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 300  
net.ipv4.tcp_keepalive_probes = 30000

This brute-force approach kept connections alive for days, but wasn't ideal for detecting dead clients.

When we tried more reasonable settings to balance connection maintenance and dead peer detection:

# More balanced keepalive configuration
net.ipv4.tcp_keepalive_time = 300    # 5 minutes
net.ipv4.tcp_keepalive_intvl = 180   # 3 minutes
net.ipv4.tcp_keepalive_probes = 10

We expected:

  • Active probes every 5 minutes for alive clients
  • Connection termination after ~33 minutes for dead clients (300 + 9*180 seconds)

Instead, Wireshark showed zero keepalive packets traversing the firewall, and connections still dropped at ~40 minutes.

Several firewall behaviors could explain this:

  1. TCP Normalization: Some firewalls rewrite TCP options, potentially stripping keepalive capability
  2. Proxy Behavior: Stateful inspection firewalls may maintain their own connection tracking
  3. Silent ACKing: The firewall might respond to keepalives on behalf of clients

To isolate the issue, we recommend:

# Check if keepalive is actually enabled per socket
ss -e -n -p | grep -A1 "1025"

# Alternative: check via /proc
cat /proc/net/tcp | grep -A1 "0401"  # 0401 is hex for port 1025

Additionally, run simultaneous packet captures on both sides of the firewall to verify where packets disappear.

When OS-level keepalives fail, implement application heartbeat:

# Python example of application keepalive
import socket
import time

def maintain_connection(sock):
    while True:
        try:
            sock.send(b'\x00')  # Null byte heartbeat
            time.sleep(240)     # Send before firewall timeout
        except socket.error:
            # Handle disconnection
            break

If you can identify the firewall type:

  • Cisco ASA: Adjust TCP idle timeout with timeout conn
  • Palo Alto: Modify TCP timeout in security policy
  • Check Point: Adjust 'keepalive' service settings

Based on our experience:

  1. Verify keepalives are actually being sent at the socket level
  2. Push for firewall configuration details or exception rules
  3. Consider application-level heartbeat as a fallback
  4. Document the 40-minute pattern as evidence for network teams