Debugging Nginx Connection Leaks: Why “worker_connections are not enough” and How to Fix CLOSE_WAIT States


4 views

When your low-traffic Nginx server suddenly complains about worker_connections being exhausted despite minimal actual load, you're likely dealing with connection leaks. The smoking gun appears when you check for lingering connections:

# lsof | grep nginx | grep CLOSE_WAIT | wc -l
1271

A CLOSE_WAIT state indicates the remote end has closed the connection, but your Nginx process hasn't released it. Common causes include:

  • Application backends not properly closing connections
  • Keepalive misconfigurations
  • Upstream timeouts not being enforced

To investigate connection states in real-time:

# Active connections breakdown
ss -tanp | grep nginx | awk '{print $1}' | sort | uniq -c

# Detailed connection tracking (replace PID)
strace -p [nginx_worker_pid] -e trace=network -s 10000

These settings often contribute to connection leaks:

# Bad:
proxy_http_version 1.0;  # Disables keepalive by default
proxy_set_header Connection ""; # Can break connection cleanup

# Good:
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Keep-Alive "";
keepalive_timeout 75s;
keepalive_requests 100;

For PHP/Python/Node backends, ensure proper connection headers:

location @proxy {
    proxy_pass http://backend;
    proxy_connect_timeout 5s;
    proxy_read_timeout 30s;
    proxy_send_timeout 30s;
    proxy_next_upstream error timeout invalid_header;
    proxy_buffer_size 4k;
    proxy_buffers 8 16k;
    reset_timedout_connection on;  # Critical!
}

Create a simple monitoring script (/usr/local/bin/nginx_conn_check):

#!/bin/bash
THRESHOLD=50
COUNT=$(ss -tanp | grep nginx | grep -c CLOSE_WAIT)

if [ "$COUNT" -gt "$THRESHOLD" ]; then
    echo "$(date) - Found $COUNT CLOSE_WAIT connections" >> /var/log/nginx_conn.log
    # Optionally trigger soft reload
    nginx -s reload
fi

Add to cron:

*/5 * * * * /usr/local/bin/nginx_conn_check

The presence of 1271 CLOSE_WAIT connections (as shown by lsof | grep nginx | grep CLOSE_WAIT | wc -l) indicates a serious connection handling issue. This TCP state means the remote peer has closed the connection, but your Nginx hasn't properly released it.

To fully understand the situation, run these commands:

# Check overall connection states
ss -antp | grep nginx

# Monitor active connections in real-time
watch -n 1 "netstat -anp | grep nginx"

# Detailed process analysis
strace -p $(pgrep nginx | head -1) -e trace=network

The most frequent causes include:

  • Improper keepalive_timeout settings
  • Upstream servers not closing connections properly
  • Missing or incorrect proxy settings
  • Client-side network issues

Add these to your nginx.conf:

http {
    # Force connection closure
    reset_timedout_connection on;
    
    # Optimize keepalive
    keepalive_timeout 30s;
    keepalive_requests 100;
    
    # Upstream connection management
    proxy_connect_timeout 5s;
    proxy_send_timeout 10s;
    proxy_read_timeout 30s;
    proxy_next_upstream_timeout 0;
    proxy_next_upstream error timeout invalid_header;
    
    # Buffer management
    proxy_buffers 16 16k;
    proxy_buffer_size 16k;
}

Adjust kernel parameters in /etc/sysctl.conf:

# Reuse sockets in TIME_WAIT state
net.ipv4.tcp_tw_reuse = 1

# Increase max number of connections
net.core.somaxconn = 65535

# Increase port range
net.ipv4.ip_local_port_range = 1024 65535

# Apply changes
sysctl -p

Create a monitoring script (/usr/local/bin/nginx_conn_check):

#!/bin/bash
THRESHOLD=500
COUNT=$(ss -ant | grep -c CLOSE_WAIT)

if [ $COUNT -gt $THRESHOLD ]; then
    echo "$(date) - CLOSE_WAIT connections ($COUNT) exceeded threshold" >> /var/log/nginx_conn.log
    systemctl reload nginx
fi

Add to cron: * * * * * /usr/local/bin/nginx_conn_check