Debugging Persistent FIN_WAIT2 Connections in Kubernetes kube-proxy: TCP Socket Cleanup Strategies


4 views

During routine monitoring of our Kubernetes cluster, we noticed kube-proxy maintaining TCP connections in FIN_WAIT2 state indefinitely, despite tcp_fin_timeout being set to 60 seconds. This violates standard TCP protocol behavior where FIN_WAIT2 should transition to CLOSED after the timeout period.

# Persistent connections found via:
$ ss -tanoe | grep FIN_WAIT2
tcp    FIN-WAIT-2 0      0      10.244.0.1:48340   10.244.0.35:56339   timer:(keepalive,119min,0)

The Linux kernel documentation states that FIN_WAIT2 sockets should timeout according to /proc/sys/net/ipv4/tcp_fin_timeout. However, we observed three scenarios where this doesn't hold:

  1. Socket Recycling: When sockets are orphaned but still referenced by user-space processes
  2. Keepalive Conflicts: TCP keepalive overriding fin_timeout in some implementations
  3. Network Namespace Issues: Container networking affecting socket cleanup

Examining kube-proxy's connection handling reveals:

// Simplified kube-proxy connection flow
func handleConnection(inbound net.Conn) {
    outbound, err := net.Dial("tcp", backendAddr)
    if err != nil {
        return
    }
    
    // Bi-directional copying without proper cleanup
    go io.Copy(outbound, inbound)
    go io.Copy(inbound, outbound)
    
    // Missing connection state tracking
}

The key issues are:

  • No connection tracking between goroutines
  • Missing context cancellation handling
  • Lack of TCP state machine awareness

Immediate Mitigation:

# Forcefully clean up lingering sockets
echo 1 > /proc/sys/net/ipv4/tcp_abort_on_overflow
sysctl -w net.ipv4.tcp_fin_timeout=30

Code-level Fixes:

// Improved connection handling
func handleConnection(ctx context.Context, inbound net.Conn) {
    defer inbound.Close()
    
    outbound, err := net.DialContext(ctx, "tcp", backendAddr)
    if err != nil {
        return
    }
    defer outbound.Close()
    
    // Use connection pool with timeout
    pool := &sync.Pool{
        MaxIdle:     10,
        IdleTimeout: 60 * time.Second,
    }
    
    // Context-aware copying
    done := make(chan struct{})
    go func() {
        io.Copy(outbound, inbound)
        close(done)
    }()
    
    select {
    case <-ctx.Done():
        inbound.SetDeadline(time.Now())
        outbound.SetDeadline(time.Now())
    case <-done:
    }
}

To diagnose stuck sockets:

# Check socket timers
cat /proc/net/tcp | grep -i fin_wait2

# Trace socket events
perf probe --add 'tcp_set_state'
perf record -e probe:tcp_set_state -a -g sleep 60

Recommended sysctl settings for kube-proxy nodes:

net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_rfc1337 = 1

In Kubernetes environments, we often observe TCP connections lingering in FIN_WAIT2 state for hours despite tcp_fin_timeout being set to 60 seconds. This is particularly common with kube-proxy components handling service traffic between pods.

# Sample netstat output showing stuck connections
$ sudo netstat -tpn | grep FIN_WAIT2
tcp6 0 0 10.244.0.1:33132 10.244.0.35:48936 FIN_WAIT2 14125/kube-proxy
tcp6 0 0 10.244.0.1:48340 10.244.0.35:56339 FIN_WAIT2 14125/kube-proxy

The normal TCP connection termination sequence should follow:

  1. Local endpoint sends FIN (enters FIN_WAIT1)
  2. Receives ACK for FIN (enters FIN_WAIT2)
  3. Waits for remote FIN (should timeout per tcp_fin_timeout)

The Linux kernel documentation states that tcp_fin_timeout (default 60s) controls how long to remain in FIN_WAIT2 state before forcibly closing the connection. However, we're seeing cases where:

$ cat /proc/sys/net/ipv4/tcp_fin_timeout
60

Yet connections remain stuck for hours. This suggests either:

  • The socket is being kept alive by userspace (kube-proxy)
  • Kernel is failing to enforce the timeout
  • Special socket options are in effect

To diagnose the root cause:

# Check socket options for the process
$ sudo ls -l /proc/14125/fd
$ sudo cat /proc/14125/net/tcp6

# Verify iptables rules that might affect connection tracking
$ sudo iptables -t raw -L -n -v
$ sudo conntrack -L

# Monitor connection state changes
$ sudo tcpdump -i any 'host 10.244.0.35 and port 48936'

For kube-proxy specifically, we can implement these mitigations:

# Reduce conntrack timeouts
$ echo 30 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_close_wait

# Enable TCP keepalive
apiVersion: v1
kind: Pod
metadata:
  name: kube-proxy
spec:
  containers:
  - name: kube-proxy
    env:
    - name: KUBE_PROXY_EXTRA_ARGS
      value: "--tcp-keepalive=true --tcp-keepalive-interval=30"

For production systems, consider these sysctl tweaks:

# Force faster socket cleanup
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

# Increase connection tracking buckets
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_max = 4194304

When developing network services that might encounter this issue:

// Go example for proper connection handling
conn, err := net.Dial("tcp", "remote:port")
if err != nil {
    log.Fatal(err)
}
defer func() {
    if err := conn.Close(); err != nil {
        log.Printf("Error closing connection: %v", err)
    }
}()

// Set socket options
tcpConn := conn.(*net.TCPConn)
tcpConn.SetLinger(0)  // Force RST instead of FIN-WAIT
tcpConn.SetKeepAlive(true)
tcpConn.SetKeepAlivePeriod(30 * time.Second)