How to Diagnose and Fix Excessive TCP CLOSE_WAIT Connections in Linux (lsof -i Analysis)


18 views

When running lsof -i | wc -l on my production server, I noticed something concerning - between 240-255 out of 420 connections were stuck in CLOSE_WAIT state. This isn't just an academic observation; it directly impacts server performance and resource utilization.

In TCP's state machine, CLOSE_WAIT occurs when:

  1. The remote end initiates connection termination (sends FIN)
  2. Your kernel ACKs the FIN
  3. Your application hasn't closed its socket yet

This is part of normal TCP shutdown sequence, but when these accumulate, it indicates application-level issues.

Persistent CLOSE_WAIT connections:

  • Consume file descriptors (risk hitting ulimit)
  • Hold kernel resources (memory, TCP control blocks)
  • May indicate application bugs or resource leaks

First, identify which processes are responsible:


sudo lsof -iTCP -sTCP:CLOSE_WAIT

For Java applications, common culprits include:


netstat -antp | grep CLOSE_WAIT | grep java

From production experience, these patterns frequently emerge:

1. Missing close() Calls

Example buggy Python code:


def handle_client(conn):
    try:
        data = conn.recv(1024)
        # process data
    except Exception as e:
        print(f"Error: {e}")
    # Missing conn.close()

2. Connection Pool Leaks

A Java connection pool issue:


// Bad practice
Connection conn = dataSource.getConnection();
try {
    // use connection
} finally {
    // Missing conn.close()
}

Kernel-Level Monitoring


watch -n 1 'ss -s | grep -i wait'

Application Profiling

For Java apps using jstack:


jstack PID | grep -A 20 "java.net.Socket"

Implement Proper Resource Cleanup

Python context manager example:


with socket.create_connection((host, port)) as conn:
    # use connection
# Automatically closed

Configure Kernel Parameters

Adjust TCP timeout (in seconds):


sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6

Consider these thresholds:

Metric Warning Level Critical Level
CLOSE_WAIT count > 100 > 50% of max_files
Duration > 5 min > 30 min

Here's how I diagnosed a Node.js issue:


# Find Node processes with CLOSE_WAIT
lsof -Pni | grep CLOSE_WAIT | grep node

# Get stack traces
kill -USR1 $(pgrep node)

# Check for unhandled socket errors
grep "ECONNRESET" /var/log/node/error.log

The CLOSE_WAIT state in TCP connections indicates that the remote end has closed the connection, but your local application hasn't yet closed its socket. When you see many connections stuck in this state (like your 240-255 out of 420), it typically means your application isn't properly handling socket closure.

Here's the TCP state transition that leads to CLOSE_WAIT:

1. ESTABLISHED → CLOSE_WAIT (when remote sends FIN)
2. CLOSE_WAIT → LAST_ACK (when local app closes socket)
3. LAST_ACK → CLOSED (when remote ACKs our FIN)

From my experience debugging production systems, these are frequent culprits:

  • Application bugs not closing sockets after use
  • Connection leaks in connection pools
  • Threads blocked indefinitely while holding sockets
  • Improper error handling in network code

First, identify which processes are holding these connections:

lsof -i TCP:CLOSE_WAIT

For a more detailed view of TCP connections:

ss -tulnp | grep CLOSE-WAIT
netstat -tulnp | grep CLOSE_WAIT  # older systems

Here's how you might track unclosed sockets in a Python application:

import socket
import weakref

class TrackedSocket(socket.socket):
    _instances = set()
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._instances.add(weakref.ref(self))
    
    @classmethod
    def get_instances(cls):
        return [ref() for ref in cls._instances if ref() is not None]

# Replace socket.socket with TrackedSocket in your code
# Periodically check TrackedSocket.get_instances()

For immediate relief, you can kill the offending processes, but proper fixes require code changes:

  • Always use context managers for sockets (Python example):
  • with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.connect(('host', port))
        # socket automatically closed
    
  • Implement connection timeouts
  • Add resource tracking in development
  • Monitor CLOSE_WAIT counts in production

While some CLOSE_WAIT connections are normal, you should investigate when:

  • The count keeps growing over time
  • It exceeds your expected connection churn rate
  • You see related errors (too many open files, etc.)

In extreme cases, you might need to adjust TCP timeout parameters:

sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_probes=3
sysctl -w net.ipv4.tcp_keepalive_intvl=15

But these are workarounds - the real solution is fixing application code.