Why DNS Failover is Problematic: Alternatives for High Availability Web Servers


17 views

While DNS failover appears straightforward for routing traffic between redundant web servers, it suffers from several technical limitations that make it unreliable for mission-critical applications:

  • TTL Propagation Delays: DNS changes can take hours to propagate globally due to caching, even with low TTL values (minimum 30-60 seconds in practice)
  • Client-Side Caching: Browsers and operating systems often ignore TTLs and cache DNS records longer than specified
  • Partial Outages: DNS failover is all-or-nothing, unable to handle partial failures or gradual traffic migration

For web servers distributed across different subnets, consider these more robust solutions:

1. Global Server Load Balancing (GSLB)

GSLB solutions perform health checks at the application layer and route traffic accordingly:

# Example GSLB configuration using Nginx Plus
upstream backend {
    zone backend_servers 64k;
    server server1.example.com resolve;
    server server2.example.com resolve;
    health_check interval=5s fails=3 passes=2;
}

2. Anycast Routing

Anycast announces the same IP address from multiple locations, letting BGP handle failover:

# BGP configuration snippet (Cisco IOS)
router bgp 64512
 network 203.0.113.0 mask 255.255.255.0
 neighbor 192.0.2.1 remote-as 64513

3. Cloud Provider Solutions

Major cloud platforms offer native solutions that outperform DNS failover:

  • AWS: Route53 Application Recovery Controller
  • Azure: Traffic Manager with endpoint monitoring
  • GCP: Global External HTTP(S) Load Balancer

Despite its limitations, DNS failover can be appropriate for:

  • Non-critical services with acceptable recovery times
  • Geographically distributed static content
  • As a secondary failover mechanism behind other solutions

Here's a Python script demonstrating health checks that could trigger multiple failover mechanisms:

import requests
from dns import resolver, update

def check_server(url):
    try:
        response = requests.get(url, timeout=5)
        return response.status_code == 200
    except:
        return False

def update_dns_failover(primary_ip, secondary_ip):
    if not check_server(f"http://{primary_ip}"):
        # This would be called after slower GSLB/Anycast fails over
        dns_update = update.UpdateMessage()
        dns_update.replace(primary_ip, 300, 'A', secondary_ip)
        # Send update to DNS servers...

Effective failover requires comprehensive monitoring:

# Prometheus alert rule for failover triggering
- alert: BackendDown
  expr: up{job="webserver"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Instance {{ $labels.instance }} down"

DNS failover remains controversial despite being offered by services like DNSmadeeasy.com. The core issue lies in DNS caching behavior - TTL (Time-To-Live) values aren't always respected by resolvers or ISPs. When a server fails, cached DNS records may continue routing traffic to the dead endpoint for hours or even days.


# Example of problematic DNS TTL settings
$ dig example.com

;; ANSWER SECTION:
example.com.        300     IN      A       192.0.2.1
example.com.        300     IN      A       192.0.2.2

For web servers distributed across different subnets, consider these more reliable approaches:

1. Anycast Routing

Anycast announces the same IP address from multiple locations. BGP routing automatically directs traffic to the nearest available node.


# Sample BGP configuration snippet (Cisco IOS)
router bgp 64512
 network 203.0.113.0 mask 255.255.255.0
 neighbor 192.0.2.1 remote-as 64512

2. Global Server Load Balancing (GSLB)

GSLB solutions like F5 BIG-IP or AWS Route 53 Traffic Flow use real-time health checks and faster DNS updates than traditional DNS failover.


# AWS Route 53 health check configuration
resource "aws_route53_health_check" "web" {
  ip_address        = "192.0.2.1"
  port             = 80
  type             = "HTTP"
  resource_path    = "/health"
  failure_threshold = 2
  request_interval = 30
}

3. Reverse Proxy with Health Checks

HAProxy or Nginx can monitor backend servers and instantly reroute traffic:


# HAProxy configuration example
backend web_servers
    balance roundrobin
    option httpchk GET /health
    server web1 192.0.2.1:80 check
    server web2 192.0.2.2:80 check backup

DNS failover can work for:

  • Non-critical services with high TTL tolerance
  • Complementary failover alongside other methods
  • Scenarios where sub-second failover isn't required

The key is understanding your RTO (Recovery Time Objective) and whether DNS propagation delays fit within that window.

When architecting your failover solution:


# Python script to test failover responsiveness
import requests
from time import time

def test_failover(url, expected_ip):
    start = time()
    try:
        resolved_ip = requests.get(url).headers['X-Server-IP']
        return {
            'success': resolved_ip == expected_ip,
            'response_time': time() - start
        }
    except Exception as e:
        return {'error': str(e)}

Always test your failover mechanism under controlled conditions before relying on it in production.