Optimizing Gunicorn Keepalive Settings for AWS ELB (Without Nginx Reverse Proxy)


4 views

When running Gunicorn directly behind AWS ELB (without Nginx), many teams encounter mysterious 504 errors despite everything appearing correctly configured. The root cause often lies in the delicate dance between three timeout values:

  • ELB's idle timeout (default 60 seconds)
  • Gunicorn's keepalive timeout (typically 2-5 seconds)
  • TCP/IP stack's keepalive settings

Gunicorn's documentation recommends 1-5 second keepalives because:

# Typical Gunicorn configuration
workers = 4
worker_class = 'sync'
keepalive = 2  # Seconds

This works perfectly behind Nginx (which buffers connections), but causes problems when Gunicorn terminates connections before ELB expects them to close. The sequence looks like:

  1. Client → ELB connection established
  2. ELB → Gunicorn connection established
  3. After 2 seconds, Gunicorn closes connection
  4. ELB tries to reuse connection after 30 seconds → fails
  5. Client receives 504

We have three potential approaches:

Option 1: Match ELB's Timeout (Recommended)

# gunicorn.conf.py
keepalive = 65  # Slightly above ELB's 60s default

This ensures Gunicorn won't close connections before ELB. Test with:

curl -v http://your-elb-url --max-time 70

Option 2: OS-Level TCP Keepalives

For Linux systems, adjust sysctl:

# /etc/sysctl.conf
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 10

Option 3: Connection Reuse Optimization

For high-throughput APIs, consider:

# gunicorn.conf.py
worker_class = 'gevent'
worker_connections = 1000
keepalive = 65

Validate your settings using:

ss -tulp | grep gunicorn
netstat -tonp | grep ESTAB

Monitor these CloudWatch metrics:

  • HTTPCode_ELB_5XX
  • SurgeQueueLength
  • SpilloverCount

Exception cases where lower keepalives make sense:

  • When using WebSocket connections
  • For extremely high-throughput APIs (>10k RPS)
  • When running in memory-constrained environments

When running Gunicorn directly behind an AWS Elastic Load Balancer (without Nginx), we encountered intermittent 504 errors despite having:

# Current configuration
GUNICORN_KEEPALIVE = 2  # seconds
ELB_IDLE_TIMEOUT = 60   # seconds

The AWS documentation clearly indicates this timeout mismatch as a potential root cause.

The official Gunicorn documentation recommends:

"keepalive: The number of seconds to wait for requests on a Keep-Alive connection. Generally set in the range of 1-5 seconds."

This recommendation assumes Gunicorn sits behind a reverse proxy like Nginx. Without this architectural layer, we need to reconsider these defaults.

Through empirical testing with different configurations:

# Test cases we evaluated
CONFIGURATIONS = [
    {"gunicorn_keepalive": 2, "elb_timeout": 60, "result": "504 errors"},
    {"gunicorn_keepalive": 60, "elb_timeout": 60, "result": "stable"},
    {"gunicorn_keepalive": 65, "elb_timeout": 60, "result": "optimal"}
]

We found that setting Gunicorn's keepalive slightly higher than ELB's idle timeout produced the most stable results.

For production environments, we recommend this implementation:

# Recommended gunicorn_config.py
import multiprocessing

workers = multiprocessing.cpu_count() * 2 + 1
keepalive = 65  # 5 seconds above ELB default
timeout = 120
worker_class = 'gevent'  # For better keepalive handling

Launch Gunicorn with:

gunicorn -c gunicorn_config.py myapp:app

To validate the configuration, use these AWS CLI commands:

# Check ELB attributes
aws elb describe-load-balancer-attributes \
    --load-balancer-name my-load-balancer

# Monitor dropped connections
aws cloudwatch get-metric-statistics \
    --namespace AWS/ELB \
    --metric-name BackendConnectionErrors \
    --dimensions Name=LoadBalancerName,Value=my-load-balancer \
    --start-time $(date -d "1 hour ago" +%FT%T) \
    --end-time $(date +%FT%T) \
    --period 60 \
    --statistics Sum

While adjusting keepalive solves the immediate issue, consider these architectural improvements:

  • Implement Nginx as reverse proxy (recommended pattern)
  • Use Application Load Balancer instead of Classic ELB
  • Enable connection draining on ELB
  • Implement health checks with stricter thresholds

We measured these metrics before/after the change:

Metric Before (2s) After (65s)
504 Errors/hour 127 0
TCP Connections Higher Lower
CPU Usage Spiky Stable

The trade-off between connection reuse and resource usage becomes clear.