Nginx “no live upstreams” Error: Diagnosing and Fixing 502 Bad Gateway with Load Balancing


10 views

You're seeing intermittent 502 Bad Gateway errors with the specific error message "no live upstreams while connecting to upstream". This occurs:

  • During page transitions between site sections
  • On the homepage when accessed via internal redirects
  • Particularly affecting JavaScript file delivery

Your current load balancing setup shows several potential issues:

upstream example.com {
  # ip_hash;
  server php01 max_fails=3 fail_timeout=15s;
  server php02 max_fails=3 fail_timeout=15s;
}

server {
  listen IP:80;
  server_name example.com;
  
  location / {
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_pass http://$server_name/$uri;
    # ... other proxy settings ...
  }
}

After analyzing similar cases, these are the most likely causes:

1. Upstream Health Checks

The max_fails and fail_timeout parameters might be too aggressive. When both upstreams fail simultaneously, Nginx has nowhere to route traffic.

2. DNS Resolution

Using $server_name in proxy_pass creates a circular reference. Nginx needs concrete upstream IPs.

3. Session Persistence

The commented ip_hash suggests you considered session stickiness, which might be necessary for your application.

Here's an improved configuration that addresses these issues:

upstream backend_cluster {
  ip_hash;
  server 192.168.1.10:80 max_fails=3 fail_timeout=30s;
  server 192.168.1.11:80 max_fails=3 fail_timeout=30s;
  keepalive 32;
}

server {
  listen 80;
  server_name example.com;
  
  location / {
    proxy_pass http://backend_cluster;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    
    # Health check settings
    proxy_next_upstream error timeout http_502 http_503;
    proxy_next_upstream_timeout 2s;
    proxy_next_upstream_tries 3;
    
    # Timeout configurations
    proxy_connect_timeout 5s;
    proxy_send_timeout 10s;
    proxy_read_timeout 30s;
  }
  
  # Special handling for JS files
  location ~* \.js$ {
    proxy_cache my_js_cache;
    proxy_cache_valid 200 302 1h;
    proxy_cache_bypass $http_cache_control;
    expires 1h;
    add_header Cache-Control "public";
    proxy_pass http://backend_cluster;
  }
}
  • Explicit IP addresses instead of hostnames
  • Added connection keepalive
  • Proper health check configuration
  • Special handling for JavaScript assets
  • More realistic timeout values
  • Enabled ip_hash for session consistency

After implementing changes:

  1. Test with curl -I http://example.com to check headers
  2. Monitor error logs: tail -f /var/log/nginx/error.log
  3. Verify upstream status: nginx -T | grep -A10 upstream

If issues persist:

# Check active connections
ss -ant | grep '80'

# Verify upstream reachability
for ip in 192.168.1.10 192.168.1.11; do
  curl -v --connect-timeout 3 http://$ip/health-check
done

# Nginx debug logging
error_log /var/log/nginx/debug.log debug;
rewrite_log on;

When dealing with intermittent 502 errors in an Nginx load balancing setup, the specific behavior tells us everything:

  • First homepage request succeeds (indicates basic connectivity works)
  • Subsequent page transitions fail (suggestive of connection handling issues)
  • JavaScript files occasionally fail (points to timeout/keepalive problems)

The current setup has several problematic areas:

upstream example.com {
    # ip_hash;
    server php01 max_fails=3 fail_timeout=15s;
    server php02 max_fails=3 fail_timeout=15s;
}

Key problems in this configuration:

  1. Missing resolve parameter for dynamic DNS updates
  2. No health check configuration
  3. Basic round-robin without session persistence

Add these directives to your nginx configuration:

proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 30s;
keepalive_timeout 60s;
keepalive_requests 100;

Here's the optimized version with all necessary fixes:

upstream example.com {
    zone backend 64k;
    server php01:80 max_fails=3 fail_timeout=15s resolve;
    server php02:80 max_fails=3 fail_timeout=15s resolve;
    
    keepalive 32;
    keepalive_timeout 60s;
}

server {
    listen IP:80;
    server_name example.com;
    
    # Connection handling
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    
    # Timeout settings
    proxy_connect_timeout 5s;
    proxy_send_timeout 10s;
    proxy_read_timeout 30s;
    
    # Standard proxy headers
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    
    location / {
        proxy_pass http://example.com;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_next_upstream_timeout 5s;
        proxy_next_upstream_tries 3;
    }
    
    # Static asset handling
    location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
        expires 30d;
        add_header Cache-Control "public, no-transform";
        proxy_pass http://example.com;
    }
}

Verify your configuration with these commands:

# Check nginx syntax
sudo nginx -t

# Check upstream status
curl http://localhost/nginx_status

# Monitor TCP connections
ss -ant | grep ESTAB | grep 80

# Check error patterns
tail -f /var/log/nginx/example.com.error | grep -E '502|upstream'

For more reliable upstream monitoring:

match server_ok {
    status 200-399;
    header Content-Type ~ "text/html";
    body !~ "maintenance";
}

upstream example.com {
    server php01:80 check interval=5000 rise=2 fall=3 match=server_ok;
    server php02:80 check interval=5000 rise=2 fall=3 match=server_ok;
}
  • Not reusing keepalive connections between Nginx and backend
  • Setting proxy timeouts too low for PHP applications
  • Missing proxy_http_version 1.1 directive
  • Overlooking DNS caching issues (always use resolve)