During routine monitoring of our Kubernetes cluster, we noticed kube-proxy maintaining TCP connections in FIN_WAIT2 state indefinitely, despite tcp_fin_timeout
being set to 60 seconds. This violates standard TCP protocol behavior where FIN_WAIT2 should transition to CLOSED after the timeout period.
# Persistent connections found via:
$ ss -tanoe | grep FIN_WAIT2
tcp FIN-WAIT-2 0 0 10.244.0.1:48340 10.244.0.35:56339 timer:(keepalive,119min,0)
The Linux kernel documentation states that FIN_WAIT2 sockets should timeout according to /proc/sys/net/ipv4/tcp_fin_timeout
. However, we observed three scenarios where this doesn't hold:
- Socket Recycling: When sockets are orphaned but still referenced by user-space processes
- Keepalive Conflicts: TCP keepalive overriding fin_timeout in some implementations
- Network Namespace Issues: Container networking affecting socket cleanup
Examining kube-proxy's connection handling reveals:
// Simplified kube-proxy connection flow
func handleConnection(inbound net.Conn) {
outbound, err := net.Dial("tcp", backendAddr)
if err != nil {
return
}
// Bi-directional copying without proper cleanup
go io.Copy(outbound, inbound)
go io.Copy(inbound, outbound)
// Missing connection state tracking
}
The key issues are:
- No connection tracking between goroutines
- Missing context cancellation handling
- Lack of TCP state machine awareness
Immediate Mitigation:
# Forcefully clean up lingering sockets
echo 1 > /proc/sys/net/ipv4/tcp_abort_on_overflow
sysctl -w net.ipv4.tcp_fin_timeout=30
Code-level Fixes:
// Improved connection handling
func handleConnection(ctx context.Context, inbound net.Conn) {
defer inbound.Close()
outbound, err := net.DialContext(ctx, "tcp", backendAddr)
if err != nil {
return
}
defer outbound.Close()
// Use connection pool with timeout
pool := &sync.Pool{
MaxIdle: 10,
IdleTimeout: 60 * time.Second,
}
// Context-aware copying
done := make(chan struct{})
go func() {
io.Copy(outbound, inbound)
close(done)
}()
select {
case <-ctx.Done():
inbound.SetDeadline(time.Now())
outbound.SetDeadline(time.Now())
case <-done:
}
}
To diagnose stuck sockets:
# Check socket timers
cat /proc/net/tcp | grep -i fin_wait2
# Trace socket events
perf probe --add 'tcp_set_state'
perf record -e probe:tcp_set_state -a -g sleep 60
Recommended sysctl settings for kube-proxy nodes:
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_rfc1337 = 1
In Kubernetes environments, we often observe TCP connections lingering in FIN_WAIT2 state for hours despite tcp_fin_timeout
being set to 60 seconds. This is particularly common with kube-proxy components handling service traffic between pods.
# Sample netstat output showing stuck connections
$ sudo netstat -tpn | grep FIN_WAIT2
tcp6 0 0 10.244.0.1:33132 10.244.0.35:48936 FIN_WAIT2 14125/kube-proxy
tcp6 0 0 10.244.0.1:48340 10.244.0.35:56339 FIN_WAIT2 14125/kube-proxy
The normal TCP connection termination sequence should follow:
- Local endpoint sends FIN (enters FIN_WAIT1)
- Receives ACK for FIN (enters FIN_WAIT2)
- Waits for remote FIN (should timeout per tcp_fin_timeout)
The Linux kernel documentation states that tcp_fin_timeout
(default 60s) controls how long to remain in FIN_WAIT2 state before forcibly closing the connection. However, we're seeing cases where:
$ cat /proc/sys/net/ipv4/tcp_fin_timeout
60
Yet connections remain stuck for hours. This suggests either:
- The socket is being kept alive by userspace (kube-proxy)
- Kernel is failing to enforce the timeout
- Special socket options are in effect
To diagnose the root cause:
# Check socket options for the process
$ sudo ls -l /proc/14125/fd
$ sudo cat /proc/14125/net/tcp6
# Verify iptables rules that might affect connection tracking
$ sudo iptables -t raw -L -n -v
$ sudo conntrack -L
# Monitor connection state changes
$ sudo tcpdump -i any 'host 10.244.0.35 and port 48936'
For kube-proxy specifically, we can implement these mitigations:
# Reduce conntrack timeouts
$ echo 30 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_close_wait
# Enable TCP keepalive
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
spec:
containers:
- name: kube-proxy
env:
- name: KUBE_PROXY_EXTRA_ARGS
value: "--tcp-keepalive=true --tcp-keepalive-interval=30"
For production systems, consider these sysctl tweaks:
# Force faster socket cleanup
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
# Increase connection tracking buckets
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_max = 4194304
When developing network services that might encounter this issue:
// Go example for proper connection handling
conn, err := net.Dial("tcp", "remote:port")
if err != nil {
log.Fatal(err)
}
defer func() {
if err := conn.Close(); err != nil {
log.Printf("Error closing connection: %v", err)
}
}()
// Set socket options
tcpConn := conn.(*net.TCPConn)
tcpConn.SetLinger(0) // Force RST instead of FIN-WAIT
tcpConn.SetKeepAlive(true)
tcpConn.SetKeepAlivePeriod(30 * time.Second)