Why DNS Failover is Problematic: Alternatives for High Availability Web Servers


2 views

While DNS failover appears straightforward for routing traffic between redundant web servers, it suffers from several technical limitations that make it unreliable for mission-critical applications:

  • TTL Propagation Delays: DNS changes can take hours to propagate globally due to caching, even with low TTL values (minimum 30-60 seconds in practice)
  • Client-Side Caching: Browsers and operating systems often ignore TTLs and cache DNS records longer than specified
  • Partial Outages: DNS failover is all-or-nothing, unable to handle partial failures or gradual traffic migration

For web servers distributed across different subnets, consider these more robust solutions:

1. Global Server Load Balancing (GSLB)

GSLB solutions perform health checks at the application layer and route traffic accordingly:

# Example GSLB configuration using Nginx Plus
upstream backend {
    zone backend_servers 64k;
    server server1.example.com resolve;
    server server2.example.com resolve;
    health_check interval=5s fails=3 passes=2;
}

2. Anycast Routing

Anycast announces the same IP address from multiple locations, letting BGP handle failover:

# BGP configuration snippet (Cisco IOS)
router bgp 64512
 network 203.0.113.0 mask 255.255.255.0
 neighbor 192.0.2.1 remote-as 64513

3. Cloud Provider Solutions

Major cloud platforms offer native solutions that outperform DNS failover:

  • AWS: Route53 Application Recovery Controller
  • Azure: Traffic Manager with endpoint monitoring
  • GCP: Global External HTTP(S) Load Balancer

Despite its limitations, DNS failover can be appropriate for:

  • Non-critical services with acceptable recovery times
  • Geographically distributed static content
  • As a secondary failover mechanism behind other solutions

Here's a Python script demonstrating health checks that could trigger multiple failover mechanisms:

import requests
from dns import resolver, update

def check_server(url):
    try:
        response = requests.get(url, timeout=5)
        return response.status_code == 200
    except:
        return False

def update_dns_failover(primary_ip, secondary_ip):
    if not check_server(f"http://{primary_ip}"):
        # This would be called after slower GSLB/Anycast fails over
        dns_update = update.UpdateMessage()
        dns_update.replace(primary_ip, 300, 'A', secondary_ip)
        # Send update to DNS servers...

Effective failover requires comprehensive monitoring:

# Prometheus alert rule for failover triggering
- alert: BackendDown
  expr: up{job="webserver"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Instance {{ $labels.instance }} down"

DNS failover remains controversial despite being offered by services like DNSmadeeasy.com. The core issue lies in DNS caching behavior - TTL (Time-To-Live) values aren't always respected by resolvers or ISPs. When a server fails, cached DNS records may continue routing traffic to the dead endpoint for hours or even days.


# Example of problematic DNS TTL settings
$ dig example.com

;; ANSWER SECTION:
example.com.        300     IN      A       192.0.2.1
example.com.        300     IN      A       192.0.2.2

For web servers distributed across different subnets, consider these more reliable approaches:

1. Anycast Routing

Anycast announces the same IP address from multiple locations. BGP routing automatically directs traffic to the nearest available node.


# Sample BGP configuration snippet (Cisco IOS)
router bgp 64512
 network 203.0.113.0 mask 255.255.255.0
 neighbor 192.0.2.1 remote-as 64512

2. Global Server Load Balancing (GSLB)

GSLB solutions like F5 BIG-IP or AWS Route 53 Traffic Flow use real-time health checks and faster DNS updates than traditional DNS failover.


# AWS Route 53 health check configuration
resource "aws_route53_health_check" "web" {
  ip_address        = "192.0.2.1"
  port             = 80
  type             = "HTTP"
  resource_path    = "/health"
  failure_threshold = 2
  request_interval = 30
}

3. Reverse Proxy with Health Checks

HAProxy or Nginx can monitor backend servers and instantly reroute traffic:


# HAProxy configuration example
backend web_servers
    balance roundrobin
    option httpchk GET /health
    server web1 192.0.2.1:80 check
    server web2 192.0.2.2:80 check backup

DNS failover can work for:

  • Non-critical services with high TTL tolerance
  • Complementary failover alongside other methods
  • Scenarios where sub-second failover isn't required

The key is understanding your RTO (Recovery Time Objective) and whether DNS propagation delays fit within that window.

When architecting your failover solution:


# Python script to test failover responsiveness
import requests
from time import time

def test_failover(url, expected_ip):
    start = time()
    try:
        resolved_ip = requests.get(url).headers['X-Server-IP']
        return {
            'success': resolved_ip == expected_ip,
            'response_time': time() - start
        }
    except Exception as e:
        return {'error': str(e)}

Always test your failover mechanism under controlled conditions before relying on it in production.