When running lsof -i | wc -l
on my production server, I noticed something concerning - between 240-255 out of 420 connections were stuck in CLOSE_WAIT state. This isn't just an academic observation; it directly impacts server performance and resource utilization.
In TCP's state machine, CLOSE_WAIT occurs when:
- The remote end initiates connection termination (sends FIN)
- Your kernel ACKs the FIN
- Your application hasn't closed its socket yet
This is part of normal TCP shutdown sequence, but when these accumulate, it indicates application-level issues.
Persistent CLOSE_WAIT connections:
- Consume file descriptors (risk hitting ulimit)
- Hold kernel resources (memory, TCP control blocks)
- May indicate application bugs or resource leaks
First, identify which processes are responsible:
sudo lsof -iTCP -sTCP:CLOSE_WAIT
For Java applications, common culprits include:
netstat -antp | grep CLOSE_WAIT | grep java
From production experience, these patterns frequently emerge:
1. Missing close() Calls
Example buggy Python code:
def handle_client(conn):
try:
data = conn.recv(1024)
# process data
except Exception as e:
print(f"Error: {e}")
# Missing conn.close()
2. Connection Pool Leaks
A Java connection pool issue:
// Bad practice
Connection conn = dataSource.getConnection();
try {
// use connection
} finally {
// Missing conn.close()
}
Kernel-Level Monitoring
watch -n 1 'ss -s | grep -i wait'
Application Profiling
For Java apps using jstack:
jstack PID | grep -A 20 "java.net.Socket"
Implement Proper Resource Cleanup
Python context manager example:
with socket.create_connection((host, port)) as conn:
# use connection
# Automatically closed
Configure Kernel Parameters
Adjust TCP timeout (in seconds):
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6
Consider these thresholds:
Metric | Warning Level | Critical Level |
---|---|---|
CLOSE_WAIT count | > 100 | > 50% of max_files |
Duration | > 5 min | > 30 min |
Here's how I diagnosed a Node.js issue:
# Find Node processes with CLOSE_WAIT
lsof -Pni | grep CLOSE_WAIT | grep node
# Get stack traces
kill -USR1 $(pgrep node)
# Check for unhandled socket errors
grep "ECONNRESET" /var/log/node/error.log
The CLOSE_WAIT state in TCP connections indicates that the remote end has closed the connection, but your local application hasn't yet closed its socket. When you see many connections stuck in this state (like your 240-255 out of 420), it typically means your application isn't properly handling socket closure.
Here's the TCP state transition that leads to CLOSE_WAIT:
1. ESTABLISHED → CLOSE_WAIT (when remote sends FIN) 2. CLOSE_WAIT → LAST_ACK (when local app closes socket) 3. LAST_ACK → CLOSED (when remote ACKs our FIN)
From my experience debugging production systems, these are frequent culprits:
- Application bugs not closing sockets after use
- Connection leaks in connection pools
- Threads blocked indefinitely while holding sockets
- Improper error handling in network code
First, identify which processes are holding these connections:
lsof -i TCP:CLOSE_WAIT
For a more detailed view of TCP connections:
ss -tulnp | grep CLOSE-WAIT netstat -tulnp | grep CLOSE_WAIT # older systems
Here's how you might track unclosed sockets in a Python application:
import socket import weakref class TrackedSocket(socket.socket): _instances = set() def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._instances.add(weakref.ref(self)) @classmethod def get_instances(cls): return [ref() for ref in cls._instances if ref() is not None] # Replace socket.socket with TrackedSocket in your code # Periodically check TrackedSocket.get_instances()
For immediate relief, you can kill the offending processes, but proper fixes require code changes:
- Always use context managers for sockets (Python example):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.connect(('host', port)) # socket automatically closed
While some CLOSE_WAIT connections are normal, you should investigate when:
- The count keeps growing over time
- It exceeds your expected connection churn rate
- You see related errors (too many open files, etc.)
In extreme cases, you might need to adjust TCP timeout parameters:
sysctl -w net.ipv4.tcp_keepalive_time=600 sysctl -w net.ipv4.tcp_keepalive_probes=3 sysctl -w net.ipv4.tcp_keepalive_intvl=15
But these are workarounds - the real solution is fixing application code.