How to Make Nginx Ignore Unreachable Upstreams During Startup

When running Nginx with multiple site configurations, a common frustration occurs during server reboots: if any upstream server in your config becomes temporarily unreachable (due to DNS resolution failure or network issues), Nginx refuses to start altogether. This creates a single point of failure where healthy sites can't serve traffic because of one problematic upstream.

By default, Nginx performs hostname resolution during configuration parsing. If an upstream hostname can't be resolved (like when your internal DNS shows example2.service.example.com as down), Nginx treats this as a fatal configuration error.

# This is what happens in the background during nginx -t
upstream example2 {
    server example2.service.example.com; # Fails if DNS returns NXDOMAIN
}

1. Using DNS Resolver with Timeout

Configure Nginx to use a resolver with timeout settings for dynamic DNS resolution:

http {
    resolver 8.8.8.8 valid=30s;
    
    upstream example2 {
        server example2.service.example.com resolve;
    }
}

2. IP Fallback with Health Checks

Combine IP addresses with active health checks:

upstream backend {
    server 192.168.1.1:80 max_fails=3 fail_timeout=30s;
    server backup.example.com:80 backup;
    server 127.0.0.1:8080 down; # Local fallback
}

3. Configuration Splitting

Separate your configs into critical and non-critical includes:

http {
    include /etc/nginx/conf.d/core/*.conf;  # Always available services
    include /etc/nginx/conf.d/optional/*.conf;  # May fail independently
}

4. Using the 'resolve' Parameter (Nginx Plus)

For commercial Nginx Plus users:

upstream dynamic {
    zone upstream_dynamic 64k;
    server example2.service.example.com resolve;
}

Here's a complete solution combining multiple techniques:

http {
    resolver 8.8.8.8 1.1.1.1 valid=10s;
    
    # Main configuration
    include /etc/nginx/sites-enabled/_stable/*.conf;
    
    # Optional configurations (will not block startup)
    include /etc/nginx/sites-enabled/_optional/*.conf;
}

# In your optional config file:
upstream example2 {
    server example2.service.example.com resolve max_fails=2;
    server fallback.example.com:80 backup;
    server 127.0.0.1:8080 down;
}

server {
    listen 80;
    server_name example2.com;
    
    location / {
        proxy_pass http://example2;
        proxy_next_upstream error timeout invalid_header;
        proxy_next_upstream_timeout 0;
        proxy_next_upstream_tries 3;
    }
}

DNS caching behavior varies between Nginx versions
The 'resolve' parameter requires Nginx Plus for production use
Always test with nginx -t before applying changes
Health checks add minimal overhead but improve reliability

Implement proper monitoring to detect configuration issues:

# In your server block
location /nginx_status {
    stub_status;
    allow 127.0.0.1;
    deny all;
}

When running multiple websites through Nginx with separate upstream configurations, a common pain point occurs during server reboots. If any upstream server is unreachable at startup, Nginx fails to start entirely - even for perfectly healthy sites. This creates unnecessary downtime for working services.

Nginx performs DNS resolution during configuration parsing, and by default treats unresolvable upstream hosts as fatal errors. When using dynamic DNS where hosts automatically register/deregister based on availability (common in microservices architectures), this becomes particularly problematic.

upstream problematic {
    server down.service.example.com;  # Causes entire Nginx to fail if DNS can't resolve
}

We can make Nginx more resilient using these techniques:

1. DNS Resolution Directive

Add resolver with valid parameter to cache DNS lookups:

http {
    resolver 8.8.8.8 valid=30s;  # Use your own DNS server here

    upstream example2 {
        server example2.service.example.com resolve;
    }
}

2. Backup Server Fallback

Configure a backup that always responds (like a local maintenance page):

upstream resilient {
    server primary.service.example.com;
    server 127.0.0.1:8080 backup;  # Local maintenance server
}

server {
    listen 8080;
    return 503 'Service Temporarily Unavailable';
}

3. Dynamic Upstream with Health Checks

Use Nginx Plus or OpenResty for active health checks:

upstream dynamic {
    zone upstream_dynamic 64k;
    server example1.service.example.com resolve;
    server example2.service.example.com resolve;
    
    health_check interval=5 fails=1 passes=1;
}

Here's a complete solution combining these approaches:

http {
    resolver 10.0.0.2 valid=10s;  # Internal DNS server
    
    # Default catch-all upstream
    upstream maintenance {
        server 127.0.0.1:8080;
    }

    server {
        listen 8080;
        location / {
            return 503 '{"status":"maintenance"}';
            add_header Content-Type application/json;
        }
    }

    # Actual site configuration
    upstream example1 {
        server example1.service.example.com resolve;
        server maintenance backup;
    }

    upstream example2 {
        server example2.service.example.com resolve;
        server maintenance backup;
    }
}

The resolve parameter requires Nginx 1.7.2+
Always test configurations with nginx -t before applying
Consider implementing proper circuit breakers in your application code
For complex environments, explore service discovery integration

If you can't modify Nginx configurations:

Use static IPs in /etc/hosts as fallback
Implement startup scripts that verify upstream availability before starting Nginx
Consider container orchestration solutions that handle this at the infrastructure level

ServerDevWorker