How to Implement Automatic Failover in Nginx Load Balancing with Health Checks


2 views

When using Nginx's basic upstream configuration for load balancing, you'll quickly encounter a critical limitation - it doesn't automatically detect and exclude failed backend servers. Here's the problematic configuration many developers start with:

upstream lb {
    server 127.0.0.1:8081;
    server 127.0.0.1:8082;
}

With this setup, Nginx continues to send requests to both servers in a round-robin fashion, even if one becomes unresponsive. This results in 50% of requests timing out when one backend fails - completely unacceptable for production environments.

The solution is to configure active health checks using Nginx's health_check directive. Here's how to modify your configuration:

upstream lb {
    server 127.0.0.1:8081 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8082 max_fails=3 fail_timeout=30s;
}

server {
    listen 89;
    server_name localhost;

    location / {
        proxy_pass http://lb;
        health_check interval=5s fails=3 passes=2 uri=/health;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}
  • max_fails: Number of consecutive failures before marking server as down
  • fail_timeout: Time period for max_fails and duration server is considered down
  • health_check: Enables active health monitoring with configurable interval
  • proxy_next_upstream: Defines which conditions should trigger failover

For mission-critical applications, consider these additional optimizations:

upstream lb {
    zone backend 64k;
    least_conn;
    
    server backend1.example.com:8081 max_fails=3 fail_timeout=30s;
    server backend2.example.com:8082 max_fails=3 fail_timeout=30s;
    server backup.example.com:8083 backup;
}

server {
    location /health {
        access_log off;
        return 200 "OK";
        add_header Content-Type text/plain;
    }
}

Verify your configuration with these commands:

nginx -t
nginx -s reload
tail -f /var/log/nginx/error.log

Check the current status of your upstream servers using the Nginx status module or commercial solutions like Nginx Plus with its detailed dashboard.


When using Nginx's basic upstream configuration like this:

upstream lb {
    server 127.0.0.1:8081;
    server 127.0.0.1:8082;
}

Nginx will happily distribute traffic even to servers that are down, resulting in timeouts and poor user experience. This is particularly problematic in production environments where high availability is crucial.

Nginx Plus has advanced health checks, but for open-source Nginx, we need to implement our own solution. Here's how to configure proper failover:

upstream lb {
    server 127.0.0.1:8081 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8082 max_fails=3 fail_timeout=30s;
}

The key parameters are:

  • max_fails: Number of unsuccessful attempts before marking server as down
  • fail_timeout: Duration server will be considered down after max_fails

For more robust setups, consider using backup servers:

upstream lb {
    server 127.0.0.1:8081;
    server 127.0.0.1:8082;
    server 127.0.0.1:8083 backup;
    server 127.0.0.1:8084 backup;
}

Backup servers only receive traffic when all primary servers are down.

For more sophisticated health checking, you can use modules like:

upstream lb {
    zone backend 64k;
    server 127.0.0.1:8081;
    server 127.0.0.1:8082;
    
    health_check interval=5s uri=/health;
}

This requires the ngx_http_upstream_hc_module (available in Nginx Plus or as third-party module).

Here's a comprehensive example:

upstream backend {
    server backend1.example.com:8080 weight=5 max_fails=3 fail_timeout=30s;
    server backend2.example.com:8080 weight=5 max_fails=3 fail_timeout=30s;
    server backup1.example.com:8080 backup;
    server backup2.example.com:8080 backup;
    
    keepalive 32;
}

server {
    listen 80;
    server_name example.com;
    
    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_next_upstream_timeout 0;
        proxy_next_upstream_tries 0;
        
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
    
    location /health {
        access_log off;
        return 200;
    }
}

This configuration includes:

  • Weighted load balancing
  • Proper failover settings
  • Keepalive connections for performance
  • Comprehensive proxy settings
  • Health check endpoint