I've encountered a puzzling issue where one of my NGINX reverse proxy configurations stops working approximately every 24 hours. The setup involves:
- Amazon ELB as the entry point
- NGINX reverse proxy handling requests
- 6 backend instances behind internal ELBs
The configuration looks like this:
server {
listen 3000;
location / {
proxy_pass http://internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com:3000;
include /etc/nginx/proxy.conf;
}
}
Surprisingly, there are no error logs, access log anomalies, or system messages indicating what's wrong. The only temporary solution is restarting NGINX, which suggests a possible connection handling issue rather than a configuration error.
Here are several approaches to address this issue:
1. DNS Resolution Improvement
ELB hostnames can change IP addresses. Add resolver configuration:
server {
listen 3000;
resolver 169.254.169.253 valid=30s;
set $backend "internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com";
location / {
proxy_pass http://$backend:3000;
include /etc/nginx/proxy.conf;
}
}
2. Connection Keepalive Settings
Add these directives to your proxy configuration:
proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_timeout 75s;
keepalive_requests 1000;
3. Monitoring and Automatic Recovery
Implement a simple monitoring script:
#!/bin/bash
status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/healthcheck)
if [ "$status" != "200" ]; then
systemctl restart nginx
echo "$(date) - NGINX restarted" >> /var/log/nginx/auto_restart.log
fi
When the issue occurs, check these:
# Check current connections
ss -tnp | grep nginx
# Verify DNS resolution
dig internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com
# Examine TCP connections
tcpdump -i eth0 port 3000 -nn -v
Consider these architectural improvements:
- Implement service discovery instead of hardcoded ELB endpoints
- Use NGINX Plus for better monitoring and debugging
- Set up proper health checks with failover mechanisms
When your nginx reverse proxy starts dropping connections to AWS ELB backends exactly once per day without any visible errors in logs, you're facing one of those infrastructure gremlins that keeps sysadmins awake. Let's dissect this systematically.
The current configuration shows:
server {
listen 3000;
location / {
proxy_pass http://internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com:3000;
include /etc/nginx/proxy.conf;
}
}
Several subtle factors could cause this daily disruption:
DNS Caching Issues
Nginx resolves ELB DNS names at startup and caches them indefinitely. AWS ELBs occasionally rotate IP addresses. Add resolver configuration:
resolver 172.16.0.23 valid=10s;
server {
listen 3000;
set $backend "internal-prod7328-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com";
location / {
proxy_pass http://$backend:3000;
include /etc/nginx/proxy.conf;
}
}
Connection Pool Timeouts
Add these critical parameters to your proxy.conf:
proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_timeout 75s;
keepalive_requests 1000;
Strace While Reproducing the Issue
When the problem occurs, run:
sudo strace -p $(cat /var/run/nginx.pid) -f -e trace=network -s 10000
Additional Logging Configuration
Enable debug-level logging temporarily:
error_log /var/log/nginx/debug.log debug;
events {
debug_connection 172.31.0.0/16;
}
Implement this simple health check script to detect failures before users do:
#!/bin/bash
for port in {3000..3005}; do
if ! curl -I --connect-timeout 3 "http://localhost:$port" &>/dev/null; then
logger "NGINX proxy $port failed health check"
systemctl restart nginx
break
fi
done
Instead of pointing directly to ELB DNS, consider using AWS Route53 private hosted zones with health checks and DNS failover configured.