Implementing High-Availability Nginx Load Balancers: Redundancy Architectures and Failover Strategies


2 views

When building self-hosted load balancing solutions with Nginx, the LB itself becomes a critical single point of failure. Traditional setups like:

upstream backend {
    server 10.0.1.101;
    server 10.0.1.102;
}

don't address what happens when the Nginx instance managing these backends goes down.

The most common solution is using Keepalived for VIP failover:

# /etc/keepalived/keepalived.conf (Master)
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass yourpassword
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0
    }
}
# /etc/keepalived/keepalived.conf (Backup)
vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass yourpassword
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0
    }
}

While DNS failover works, the TTL limitations make it unsuitable for most scenarios:

  • Typical TTLs of 300s mean 5+ minutes of downtime
  • DNS caching by clients and resolvers extends actual failover time
  • Health checks have propagation delays

For high-traffic environments, consider an active-active setup with BGP:

# Example BGP configuration (using Bird)
protocol bgp {
    local as 64512;
    neighbor 192.168.1.254 as 64511;
    import all;
    export where proto = "direct";
}

This allows multiple load balancers to announce the same anycast IP.

Essential health checks should include:

location /nginx-health {
    access_log off;
    default_type text/plain;
    return 200 "healthy";
}

# Additional checks for backend connectivity
location /lb-status {
    stub_status on;
    allow 127.0.0.1;
    allow 192.168.1.0/24;
    deny all;
}

Use rsync or similar to keep configurations identical:

*/5 * * * * rsync -az /etc/nginx/ backup-server:/etc/nginx/

Consider tools like Ansible for more complex synchronization:

- hosts: loadbalancers
  tasks:
    - name: Sync nginx config
      synchronize:
        src: /etc/nginx/
        dest: /etc/nginx/

When failing over, consider sticky sessions:

upstream backend {
    ip_hash;
    server 10.0.1.101;
    server 10.0.1.102;
}

Or implement shared session storage like Redis:

proxy_set_header X-Sticky-Session $cookie_JSESSIONID;

When building self-managed load balancing solutions with Nginx, the load balancer itself becomes a critical single point of failure. While we meticulously design backend server redundancy, the LB layer often gets overlooked. Let's examine practical approaches to eliminate this vulnerability.

The most straightforward redundancy model uses keepalived for VIP failover:


# keepalived.conf (Master Node)
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass yourpassword
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0
    }
}

# keepalived.conf (Backup Node)
vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass yourpassword
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0
    }
}

While DNS failover works, its effectiveness depends on TTL settings. Cloudflare's 1-second TTL Enterprise DNS or Route53's health checks with 60-second TTLs can make this viable:


# BIND Zone File Example
$TTL 60
@       IN      SOA     ns1.example.com. admin.example.com. (
                        2023081501 ; serial
                        3600       ; refresh
                        900        ; retry
                        604800     ; expire
                        60         ; minimum TTL
                        )
        IN      NS      ns1.example.com.
        IN      NS      ns2.example.com.

www     IN      A       192.168.1.100
        IN      A       192.168.1.101 ; secondary IP

For global deployments, consider BGP anycast with multiple PoPs. This requires network infrastructure support:


# Bird BGP Configuration Example
protocol bgp {
    local as 64512;
    neighbor 192.0.2.1 as 64511;
    source address 203.0.113.1;
    import none;
    export where proto = "direct";
}

Implement layered health checks using both L4 and L7 probes:


# Nginx Health Check Endpoint
location /lb-health {
    access_log off;
    return 200 "healthy\n";
    add_header Content-Type text/plain;
}

Maintain identical configurations across LB nodes using rsync or git:


# Example crontab entry for config sync
*/5 * * * * rsync -az /etc/nginx/ backup-lb:/etc/nginx/ && ssh backup-lb "nginx -t && systemctl reload nginx"

When using sticky sessions, synchronize state between nodes:


# Nginx sticky module with shared memory zone
upstream backend {
    sticky zone=srv_zone:1m;
    server 10.1.1.101;
    server 10.1.1.102;
}