NGINX Reverse Proxy Fails Daily: Troubleshooting ELB Connectivity Issues


2 views

I've encountered a puzzling issue where one of my NGINX reverse proxy configurations stops working approximately every 24 hours. The setup involves:

  • Amazon ELB as the entry point
  • NGINX reverse proxy handling requests
  • 6 backend instances behind internal ELBs

The configuration looks like this:

server {
    listen 3000;
    location / {
        proxy_pass http://internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com:3000;
        include /etc/nginx/proxy.conf;
    }
}

Surprisingly, there are no error logs, access log anomalies, or system messages indicating what's wrong. The only temporary solution is restarting NGINX, which suggests a possible connection handling issue rather than a configuration error.

Here are several approaches to address this issue:

1. DNS Resolution Improvement

ELB hostnames can change IP addresses. Add resolver configuration:

server {
    listen 3000;
    resolver 169.254.169.253 valid=30s;
    set $backend "internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com";
    location / {
        proxy_pass http://$backend:3000;
        include /etc/nginx/proxy.conf;
    }
}

2. Connection Keepalive Settings

Add these directives to your proxy configuration:

proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_timeout 75s;
keepalive_requests 1000;

3. Monitoring and Automatic Recovery

Implement a simple monitoring script:

#!/bin/bash
status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/healthcheck)
if [ "$status" != "200" ]; then
    systemctl restart nginx
    echo "$(date) - NGINX restarted" >> /var/log/nginx/auto_restart.log
fi

When the issue occurs, check these:

# Check current connections
ss -tnp | grep nginx

# Verify DNS resolution
dig internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com

# Examine TCP connections
tcpdump -i eth0 port 3000 -nn -v

Consider these architectural improvements:

  • Implement service discovery instead of hardcoded ELB endpoints
  • Use NGINX Plus for better monitoring and debugging
  • Set up proper health checks with failover mechanisms

When your nginx reverse proxy starts dropping connections to AWS ELB backends exactly once per day without any visible errors in logs, you're facing one of those infrastructure gremlins that keeps sysadmins awake. Let's dissect this systematically.

The current configuration shows:

server {
    listen 3000;
    location / {
        proxy_pass http://internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com:3000;
        include /etc/nginx/proxy.conf;
    }
}

Several subtle factors could cause this daily disruption:

DNS Caching Issues

Nginx resolves ELB DNS names at startup and caches them indefinitely. AWS ELBs occasionally rotate IP addresses. Add resolver configuration:

resolver 172.16.0.23 valid=10s;
server {
    listen 3000;
    set $backend "internal-prod7328-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com";
    location / {
        proxy_pass http://$backend:3000;
        include /etc/nginx/proxy.conf;
    }
}

Connection Pool Timeouts

Add these critical parameters to your proxy.conf:

proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_timeout 75s;
keepalive_requests 1000;

Strace While Reproducing the Issue

When the problem occurs, run:

sudo strace -p $(cat /var/run/nginx.pid) -f -e trace=network -s 10000

Additional Logging Configuration

Enable debug-level logging temporarily:

error_log /var/log/nginx/debug.log debug;
events {
    debug_connection 172.31.0.0/16;
}

Implement this simple health check script to detect failures before users do:

#!/bin/bash
for port in {3000..3005}; do
    if ! curl -I --connect-timeout 3 "http://localhost:$port" &>/dev/null; then
        logger "NGINX proxy $port failed health check"
        systemctl restart nginx
        break
    fi
done

Instead of pointing directly to ELB DNS, consider using AWS Route53 private hosted zones with health checks and DNS failover configured.