Optimizing NGINX for High Concurrent Connections: Solving 200+ Timeout Issues

When testing NGINX with 200+ concurrent connections using blitz.io, we're observing significant timeout issues despite adequate server resources. The symptoms suggest a configuration bottleneck rather than hardware limitations.

Let's examine the key parameters that need adjustment:

# System-level TCP optimizations
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65536

Here's an optimized NGINX configuration template for high concurrency:

worker_processes auto;
worker_rlimit_nofile 100000;

events {
    worker_connections 4096;
    multi_accept on;
    use epoll;
}

http {
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 30;
    keepalive_requests 100;
    
    open_file_cache max=200000 inactive=20s;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;
    
    client_body_timeout 15;
    client_header_timeout 15;
    send_timeout 15;
    
    reset_timedout_connection on;
    
    # Buffer sizes
    client_body_buffer_size 128k;
    client_header_buffer_size 8k;
    large_client_header_buffers 8 16k;
    output_buffers 4 32k;
    postpone_output 1460;
    
    # Gzip settings
    gzip on;
    gzip_min_length 10240;
    gzip_proxied expired no-cache no-store private auth;
    gzip_types text/plain text/css text/xml text/javascript application/json;
    gzip_disable "msie6";
    gzip_vary on;
}

These sysctl settings dramatically improve performance:

# Increase the number of incoming connections backlog
net.core.netdev_max_backlog = 65536

# Increase maximum amount of option memory buffers
net.core.optmem_max = 25165824

# Increase the maximum number of remembered connection requests
net.ipv4.tcp_max_syn_backlog = 65536

# Increase the local port range
net.ipv4.ip_local_port_range = 1024 65535

# Reduce TCP keepalive time
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60

When benchmarking with blitz.io, use these recommended parameters:

# Test command example
blitz -k -n 10000 -c 500 -t 60 http://yourserver.com/test.txt

# Where:
# -k : keep-alive connections
# -n : total requests
# -c : concurrent connections
# -t : timeout in seconds

Essential commands to monitor performance:

# Watch active connections
watch -n 1 "netstat -n | awk '/^tcp/ {++S[\$NF]} END {for(a in S) print a, S[a]}'"

# Monitor NGINX status
tail -f /var/log/nginx/{access,error}.log

# Check system limits
cat /proc/$(cat /var/run/nginx.pid)/limits

Setting worker_connections higher than worker_rlimit_nofile
Forgetting to update ulimit for the nginx user
Not enabling keepalive connections
Using default TCP stack settings
Overlooking file descriptor limits at system level

When dealing with timeout issues in NGINX under high concurrency (>200 connections), we need to examine multiple layers of the stack. From your configuration and test results, several potential culprits emerge:

# Key metrics during test:
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20225 nginx     20   0 48140 6248 1672 S 16.0  0.0   0:21.68 nginx

Your current sysctl configuration is good but needs refinement for extreme concurrency:

# Critical TCP stack optimizations
net.ipv4.tcp_max_syn_backlog = 65536
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_window_scaling = 1
net.core.netdev_max_backlog = 65536

The worker configuration needs adjustment based on your server's CPU architecture:

worker_processes auto; # Better than static count
worker_rlimit_nofile 100000; # Match ulimit settings

events {
    worker_connections 65536;
    use epoll; # Critical for Linux
    multi_accept on;
    accept_mutex off; # For high contention scenarios
}

For static file serving under load, these directives are crucial:

http {
    sendfile_max_chunk 512k;
    tcp_nopush on;
    tcp_nodelay on;
    reset_timedout_connection on;

    # Keepalive tuning
    keepalive_requests 100000;
    keepalive_timeout 30s;
}

When testing with blitz.io, consider these parameters:

# Recommended test command:
--region ireland --rampup 1-1000:30 --hold-for 60s \
-T 5000 --timeout 45 http://dev.anuary.com/test.txt

# Key metrics to monitor:
- TCP retransmits (netstat -s | grep retransmit)
- Connection queue drops (ss -lntp | grep nginx)
- File descriptor usage (ls -l /proc/$(pgrep nginx)/fd | wc -l)

When timeouts persist, enable these diagnostic tools:

# In nginx.conf:
error_log /var/log/nginx/error.log debug;

# Monitor kernel drops:
watch -n 1 'grep -E "drop|overflow" /proc/net/netstat'

# Real-time connection states:
ss -antop | awk '{print $1}' | sort | uniq -c

After implementing these changes, verify:

ulimit -n shows at least 65535 for nginx user
sysctl values are applied (sysctl -p)
NOFILE limits in /etc/security/limits.conf
NGINX worker processes have sufficient memory (check with pmap)

ServerDevWorker

Optimizing NGINX for High Concurrent Connections: Solving 200+ Timeout Issues

Related Articles