Nginx “no live upstreams” Error: Diagnosing and Fixing 502 Bad Gateway with Load Balancing


1 views

You're seeing intermittent 502 Bad Gateway errors with the specific error message "no live upstreams while connecting to upstream". This occurs:

  • During page transitions between site sections
  • On the homepage when accessed via internal redirects
  • Particularly affecting JavaScript file delivery

Your current load balancing setup shows several potential issues:

upstream example.com {
  # ip_hash;
  server php01 max_fails=3 fail_timeout=15s;
  server php02 max_fails=3 fail_timeout=15s;
}

server {
  listen IP:80;
  server_name example.com;
  
  location / {
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_pass http://$server_name/$uri;
    # ... other proxy settings ...
  }
}

After analyzing similar cases, these are the most likely causes:

1. Upstream Health Checks

The max_fails and fail_timeout parameters might be too aggressive. When both upstreams fail simultaneously, Nginx has nowhere to route traffic.

2. DNS Resolution

Using $server_name in proxy_pass creates a circular reference. Nginx needs concrete upstream IPs.

3. Session Persistence

The commented ip_hash suggests you considered session stickiness, which might be necessary for your application.

Here's an improved configuration that addresses these issues:

upstream backend_cluster {
  ip_hash;
  server 192.168.1.10:80 max_fails=3 fail_timeout=30s;
  server 192.168.1.11:80 max_fails=3 fail_timeout=30s;
  keepalive 32;
}

server {
  listen 80;
  server_name example.com;
  
  location / {
    proxy_pass http://backend_cluster;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    
    # Health check settings
    proxy_next_upstream error timeout http_502 http_503;
    proxy_next_upstream_timeout 2s;
    proxy_next_upstream_tries 3;
    
    # Timeout configurations
    proxy_connect_timeout 5s;
    proxy_send_timeout 10s;
    proxy_read_timeout 30s;
  }
  
  # Special handling for JS files
  location ~* \.js$ {
    proxy_cache my_js_cache;
    proxy_cache_valid 200 302 1h;
    proxy_cache_bypass $http_cache_control;
    expires 1h;
    add_header Cache-Control "public";
    proxy_pass http://backend_cluster;
  }
}
  • Explicit IP addresses instead of hostnames
  • Added connection keepalive
  • Proper health check configuration
  • Special handling for JavaScript assets
  • More realistic timeout values
  • Enabled ip_hash for session consistency

After implementing changes:

  1. Test with curl -I http://example.com to check headers
  2. Monitor error logs: tail -f /var/log/nginx/error.log
  3. Verify upstream status: nginx -T | grep -A10 upstream

If issues persist:

# Check active connections
ss -ant | grep '80'

# Verify upstream reachability
for ip in 192.168.1.10 192.168.1.11; do
  curl -v --connect-timeout 3 http://$ip/health-check
done

# Nginx debug logging
error_log /var/log/nginx/debug.log debug;
rewrite_log on;

When dealing with intermittent 502 errors in an Nginx load balancing setup, the specific behavior tells us everything:

  • First homepage request succeeds (indicates basic connectivity works)
  • Subsequent page transitions fail (suggestive of connection handling issues)
  • JavaScript files occasionally fail (points to timeout/keepalive problems)

The current setup has several problematic areas:

upstream example.com {
    # ip_hash;
    server php01 max_fails=3 fail_timeout=15s;
    server php02 max_fails=3 fail_timeout=15s;
}

Key problems in this configuration:

  1. Missing resolve parameter for dynamic DNS updates
  2. No health check configuration
  3. Basic round-robin without session persistence

Add these directives to your nginx configuration:

proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 30s;
keepalive_timeout 60s;
keepalive_requests 100;

Here's the optimized version with all necessary fixes:

upstream example.com {
    zone backend 64k;
    server php01:80 max_fails=3 fail_timeout=15s resolve;
    server php02:80 max_fails=3 fail_timeout=15s resolve;
    
    keepalive 32;
    keepalive_timeout 60s;
}

server {
    listen IP:80;
    server_name example.com;
    
    # Connection handling
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    
    # Timeout settings
    proxy_connect_timeout 5s;
    proxy_send_timeout 10s;
    proxy_read_timeout 30s;
    
    # Standard proxy headers
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    
    location / {
        proxy_pass http://example.com;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_next_upstream_timeout 5s;
        proxy_next_upstream_tries 3;
    }
    
    # Static asset handling
    location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
        expires 30d;
        add_header Cache-Control "public, no-transform";
        proxy_pass http://example.com;
    }
}

Verify your configuration with these commands:

# Check nginx syntax
sudo nginx -t

# Check upstream status
curl http://localhost/nginx_status

# Monitor TCP connections
ss -ant | grep ESTAB | grep 80

# Check error patterns
tail -f /var/log/nginx/example.com.error | grep -E '502|upstream'

For more reliable upstream monitoring:

match server_ok {
    status 200-399;
    header Content-Type ~ "text/html";
    body !~ "maintenance";
}

upstream example.com {
    server php01:80 check interval=5000 rise=2 fall=3 match=server_ok;
    server php02:80 check interval=5000 rise=2 fall=3 match=server_ok;
}
  • Not reusing keepalive connections between Nginx and backend
  • Setting proxy timeouts too low for PHP applications
  • Missing proxy_http_version 1.1 directive
  • Overlooking DNS caching issues (always use resolve)