Debugging Gunicorn Critical Worker Timeout: EPIPE Errors and Nginx 502/504 Gateway Solutions


2 views

When your Gunicorn workers repeatedly timeout with EPIPE errors despite identical server configurations, you're likely facing one of these underlying issues:

# Typical error sequence in logs
[CRITICAL] WORKER TIMEOUT (pid:4994)
[INFO] Booting worker with pid: 22140  
[DEBUG] Ignoring EPIPE
[CRITICAL] WORKER TIMEOUT (pid:4993)
[ERROR] 502 Bad Gateway (Nginx)

These gunicorn.conf.py settings have resolved timeout issues in production environments:

# Recommended for CPU-bound applications
workers = (2 * cpu_cores) + 1
timeout = 120  
keepalive = 75
graceful_timeout = 30
worker_class = 'gevent'  # or 'uvicorn.workers.UvicornWorker' for ASGI

# For I/O bound apps add:
worker_connections = 1000
max_requests = 1000
max_requests_jitter = 50

The "Ignoring EPIPE" messages indicate broken pipe connections between Nginx and Gunicorn. Common triggers include:

  • Network instability between containers/VMs
  • OS-level socket buffer limits
  • Keepalive misconfiguration
  • DNS resolution delays

Add these directives to your nginx.conf:

location / {
    proxy_pass http://unix:/tmp/gunicorn.sock;
    proxy_read_timeout 300s;
    proxy_connect_timeout 75s;
    proxy_send_timeout 60s;
    
    proxy_buffer_size 128k;
    proxy_buffers 4 256k;
    proxy_busy_buffers_size 256k;
    
    proxy_http_version 1.1;
    proxy_set_header Connection "";
}

Run these when timeouts occur:

# Check system resource limits
ulimit -a
cat /proc/$(pgrep gunicorn)/limits

# Monitor socket connections
ss -tulpn | grep gunicorn
netstat -tnlp | grep ':80'

# Debug worker hangs
strace -p $(pgrep -f "gunicorn: worker")
gdb -p $(pgrep -f "gunicorn: worker") -ex "thread apply all bt" -batch

For more reliable worker management with systemd:

[Unit]
Description=gunicorn daemon
After=network.target

[Service]
User=www-data
Group=www-data
WorkingDirectory=/your/project/path
ExecStart=/path/to/gunicorn --config gunicorn.conf.py wsgi:application
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID  
Restart=on-failure
RestartSec=5s
KillSignal=SIGQUIT
TimeoutStopSec=5
PrivateTmp=true

[Install]  
WantedBy=multi-user.target

When running a Python web application with Gunicorn and Nginx, you might encounter persistent worker timeouts accompanied by these telltale signs:

2023-08-20 14:29:53 [1267] [CRITICAL] WORKER TIMEOUT (pid:4994)
2023-08-20 14:29:53 [22140] [INFO] Booting worker with pid: 22140
2023-08-20 14:29:53 [22140] [DEBUG] Ignoring EPIPE

The cycle typically continues until you manually restart Gunicorn, with Nginx returning 502/504 errors to end users.

From debugging similar setups, I've found these frequent culprits:

  • Resource starvation (CPU/memory contention)
  • Blocking operations in application code
  • Insufficient worker timeout configuration
  • Network connectivity issues between Nginx and Gunicorn
  • Socket buffer overflow conditions

Here's a battle-tested Gunicorn configuration that handles heavy workloads:

# gunicorn_config.py
workers = 4
worker_class = 'gevent'
worker_connections = 1000
timeout = 120
keepalive = 60
graceful_timeout = 30
limit_request_line = 4094
limit_request_fields = 100

Key adjustments:

  • Increased timeout from default 30s to 120s
  • Added gevent worker class for async operations
  • Configured keepalive to maintain stable connections

The "Ignoring EPIPE" messages typically indicate broken pipe conditions. This Nginx configuration helps stabilize the proxy connection:

# nginx.conf
proxy_connect_timeout 600s;
proxy_send_timeout 600s;
proxy_read_timeout 600s;
send_timeout 600s;
proxy_buffer_size 128k;
proxy_buffers 4 256k;
proxy_busy_buffers_size 256k;

For production systems, implement this monitoring snippet to catch issues early:

#!/bin/bash
# monitor_gunicorn.sh

while true; do
    if curl -s --max-time 5 http://localhost:8000/health-check | grep -q 'OK'; then
        sleep 30
    else
        systemctl restart gunicorn
        echo "$(date) - Restarted Gunicorn" >> /var/log/gunicorn_monitor.log
    fi
done

If timeouts persist after these adjustments:

  1. Run strace -p [worker_pid] to identify blocking syscalls
  2. Check dmesg for OOM killer activity
  3. Profile application with py-spy or cProfile
  4. Consider moving CPU-bound tasks to Celery