When running Gunicorn directly behind AWS ELB (without Nginx), many teams encounter mysterious 504 errors despite everything appearing correctly configured. The root cause often lies in the delicate dance between three timeout values:
- ELB's idle timeout (default 60 seconds)
- Gunicorn's keepalive timeout (typically 2-5 seconds)
- TCP/IP stack's keepalive settings
Gunicorn's documentation recommends 1-5 second keepalives because:
# Typical Gunicorn configuration
workers = 4
worker_class = 'sync'
keepalive = 2 # Seconds
This works perfectly behind Nginx (which buffers connections), but causes problems when Gunicorn terminates connections before ELB expects them to close. The sequence looks like:
- Client → ELB connection established
- ELB → Gunicorn connection established
- After 2 seconds, Gunicorn closes connection
- ELB tries to reuse connection after 30 seconds → fails
- Client receives 504
We have three potential approaches:
Option 1: Match ELB's Timeout (Recommended)
# gunicorn.conf.py
keepalive = 65 # Slightly above ELB's 60s default
This ensures Gunicorn won't close connections before ELB. Test with:
curl -v http://your-elb-url --max-time 70
Option 2: OS-Level TCP Keepalives
For Linux systems, adjust sysctl:
# /etc/sysctl.conf
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 10
Option 3: Connection Reuse Optimization
For high-throughput APIs, consider:
# gunicorn.conf.py
worker_class = 'gevent'
worker_connections = 1000
keepalive = 65
Validate your settings using:
ss -tulp | grep gunicorn
netstat -tonp | grep ESTAB
Monitor these CloudWatch metrics:
- HTTPCode_ELB_5XX
- SurgeQueueLength
- SpilloverCount
Exception cases where lower keepalives make sense:
- When using WebSocket connections
- For extremely high-throughput APIs (>10k RPS)
- When running in memory-constrained environments
When running Gunicorn directly behind an AWS Elastic Load Balancer (without Nginx), we encountered intermittent 504 errors despite having:
# Current configuration
GUNICORN_KEEPALIVE = 2 # seconds
ELB_IDLE_TIMEOUT = 60 # seconds
The AWS documentation clearly indicates this timeout mismatch as a potential root cause.
The official Gunicorn documentation recommends:
"keepalive: The number of seconds to wait for requests on a Keep-Alive connection. Generally set in the range of 1-5 seconds."
This recommendation assumes Gunicorn sits behind a reverse proxy like Nginx. Without this architectural layer, we need to reconsider these defaults.
Through empirical testing with different configurations:
# Test cases we evaluated
CONFIGURATIONS = [
{"gunicorn_keepalive": 2, "elb_timeout": 60, "result": "504 errors"},
{"gunicorn_keepalive": 60, "elb_timeout": 60, "result": "stable"},
{"gunicorn_keepalive": 65, "elb_timeout": 60, "result": "optimal"}
]
We found that setting Gunicorn's keepalive slightly higher than ELB's idle timeout produced the most stable results.
For production environments, we recommend this implementation:
# Recommended gunicorn_config.py
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
keepalive = 65 # 5 seconds above ELB default
timeout = 120
worker_class = 'gevent' # For better keepalive handling
Launch Gunicorn with:
gunicorn -c gunicorn_config.py myapp:app
To validate the configuration, use these AWS CLI commands:
# Check ELB attributes
aws elb describe-load-balancer-attributes \
--load-balancer-name my-load-balancer
# Monitor dropped connections
aws cloudwatch get-metric-statistics \
--namespace AWS/ELB \
--metric-name BackendConnectionErrors \
--dimensions Name=LoadBalancerName,Value=my-load-balancer \
--start-time $(date -d "1 hour ago" +%FT%T) \
--end-time $(date +%FT%T) \
--period 60 \
--statistics Sum
While adjusting keepalive solves the immediate issue, consider these architectural improvements:
- Implement Nginx as reverse proxy (recommended pattern)
- Use Application Load Balancer instead of Classic ELB
- Enable connection draining on ELB
- Implement health checks with stricter thresholds
We measured these metrics before/after the change:
Metric | Before (2s) | After (65s) |
---|---|---|
504 Errors/hour | 127 | 0 |
TCP Connections | Higher | Lower |
CPU Usage | Spiky | Stable |
The trade-off between connection reuse and resource usage becomes clear.