Optimizing NGINX for High Concurrent Connections: Solving 200+ Timeout Issues


2 views

When testing NGINX with 200+ concurrent connections using blitz.io, we're observing significant timeout issues despite adequate server resources. The symptoms suggest a configuration bottleneck rather than hardware limitations.

Let's examine the key parameters that need adjustment:

# System-level TCP optimizations
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65536

Here's an optimized NGINX configuration template for high concurrency:

worker_processes auto;
worker_rlimit_nofile 100000;

events {
    worker_connections 4096;
    multi_accept on;
    use epoll;
}

http {
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 30;
    keepalive_requests 100;
    
    open_file_cache max=200000 inactive=20s;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;
    
    client_body_timeout 15;
    client_header_timeout 15;
    send_timeout 15;
    
    reset_timedout_connection on;
    
    # Buffer sizes
    client_body_buffer_size 128k;
    client_header_buffer_size 8k;
    large_client_header_buffers 8 16k;
    output_buffers 4 32k;
    postpone_output 1460;
    
    # Gzip settings
    gzip on;
    gzip_min_length 10240;
    gzip_proxied expired no-cache no-store private auth;
    gzip_types text/plain text/css text/xml text/javascript application/json;
    gzip_disable "msie6";
    gzip_vary on;
}

These sysctl settings dramatically improve performance:

# Increase the number of incoming connections backlog
net.core.netdev_max_backlog = 65536

# Increase maximum amount of option memory buffers
net.core.optmem_max = 25165824

# Increase the maximum number of remembered connection requests
net.ipv4.tcp_max_syn_backlog = 65536

# Increase the local port range
net.ipv4.ip_local_port_range = 1024 65535

# Reduce TCP keepalive time
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60

When benchmarking with blitz.io, use these recommended parameters:

# Test command example
blitz -k -n 10000 -c 500 -t 60 http://yourserver.com/test.txt

# Where:
# -k : keep-alive connections
# -n : total requests
# -c : concurrent connections
# -t : timeout in seconds

Essential commands to monitor performance:

# Watch active connections
watch -n 1 "netstat -n | awk '/^tcp/ {++S[\$NF]} END {for(a in S) print a, S[a]}'"

# Monitor NGINX status
tail -f /var/log/nginx/{access,error}.log

# Check system limits
cat /proc/$(cat /var/run/nginx.pid)/limits
  • Setting worker_connections higher than worker_rlimit_nofile
  • Forgetting to update ulimit for the nginx user
  • Not enabling keepalive connections
  • Using default TCP stack settings
  • Overlooking file descriptor limits at system level

When dealing with timeout issues in NGINX under high concurrency (>200 connections), we need to examine multiple layers of the stack. From your configuration and test results, several potential culprits emerge:

# Key metrics during test:
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20225 nginx     20   0 48140 6248 1672 S 16.0  0.0   0:21.68 nginx

Your current sysctl configuration is good but needs refinement for extreme concurrency:

# Critical TCP stack optimizations
net.ipv4.tcp_max_syn_backlog = 65536
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_window_scaling = 1
net.core.netdev_max_backlog = 65536

The worker configuration needs adjustment based on your server's CPU architecture:

worker_processes auto; # Better than static count
worker_rlimit_nofile 100000; # Match ulimit settings

events {
    worker_connections 65536;
    use epoll; # Critical for Linux
    multi_accept on;
    accept_mutex off; # For high contention scenarios
}

For static file serving under load, these directives are crucial:

http {
    sendfile_max_chunk 512k;
    tcp_nopush on;
    tcp_nodelay on;
    reset_timedout_connection on;

    # Keepalive tuning
    keepalive_requests 100000;
    keepalive_timeout 30s;
}

When testing with blitz.io, consider these parameters:

# Recommended test command:
--region ireland --rampup 1-1000:30 --hold-for 60s \
-T 5000 --timeout 45 http://dev.anuary.com/test.txt

# Key metrics to monitor:
- TCP retransmits (netstat -s | grep retransmit)
- Connection queue drops (ss -lntp | grep nginx)
- File descriptor usage (ls -l /proc/$(pgrep nginx)/fd | wc -l)

When timeouts persist, enable these diagnostic tools:

# In nginx.conf:
error_log /var/log/nginx/error.log debug;

# Monitor kernel drops:
watch -n 1 'grep -E "drop|overflow" /proc/net/netstat'

# Real-time connection states:
ss -antop | awk '{print $1}' | sort | uniq -c

After implementing these changes, verify:

  1. ulimit -n shows at least 65535 for nginx user
  2. sysctl values are applied (sysctl -p)
  3. NOFILE limits in /etc/security/limits.conf
  4. NGINX worker processes have sufficient memory (check with pmap)