While DNS failover appears straightforward for routing traffic between redundant web servers, it suffers from several technical limitations that make it unreliable for mission-critical applications:
- TTL Propagation Delays: DNS changes can take hours to propagate globally due to caching, even with low TTL values (minimum 30-60 seconds in practice)
- Client-Side Caching: Browsers and operating systems often ignore TTLs and cache DNS records longer than specified
- Partial Outages: DNS failover is all-or-nothing, unable to handle partial failures or gradual traffic migration
For web servers distributed across different subnets, consider these more robust solutions:
1. Global Server Load Balancing (GSLB)
GSLB solutions perform health checks at the application layer and route traffic accordingly:
# Example GSLB configuration using Nginx Plus
upstream backend {
zone backend_servers 64k;
server server1.example.com resolve;
server server2.example.com resolve;
health_check interval=5s fails=3 passes=2;
}
2. Anycast Routing
Anycast announces the same IP address from multiple locations, letting BGP handle failover:
# BGP configuration snippet (Cisco IOS)
router bgp 64512
network 203.0.113.0 mask 255.255.255.0
neighbor 192.0.2.1 remote-as 64513
3. Cloud Provider Solutions
Major cloud platforms offer native solutions that outperform DNS failover:
- AWS: Route53 Application Recovery Controller
- Azure: Traffic Manager with endpoint monitoring
- GCP: Global External HTTP(S) Load Balancer
Despite its limitations, DNS failover can be appropriate for:
- Non-critical services with acceptable recovery times
- Geographically distributed static content
- As a secondary failover mechanism behind other solutions
Here's a Python script demonstrating health checks that could trigger multiple failover mechanisms:
import requests
from dns import resolver, update
def check_server(url):
try:
response = requests.get(url, timeout=5)
return response.status_code == 200
except:
return False
def update_dns_failover(primary_ip, secondary_ip):
if not check_server(f"http://{primary_ip}"):
# This would be called after slower GSLB/Anycast fails over
dns_update = update.UpdateMessage()
dns_update.replace(primary_ip, 300, 'A', secondary_ip)
# Send update to DNS servers...
Effective failover requires comprehensive monitoring:
# Prometheus alert rule for failover triggering
- alert: BackendDown
expr: up{job="webserver"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
DNS failover remains controversial despite being offered by services like DNSmadeeasy.com. The core issue lies in DNS caching behavior - TTL (Time-To-Live) values aren't always respected by resolvers or ISPs. When a server fails, cached DNS records may continue routing traffic to the dead endpoint for hours or even days.
# Example of problematic DNS TTL settings
$ dig example.com
;; ANSWER SECTION:
example.com. 300 IN A 192.0.2.1
example.com. 300 IN A 192.0.2.2
For web servers distributed across different subnets, consider these more reliable approaches:
1. Anycast Routing
Anycast announces the same IP address from multiple locations. BGP routing automatically directs traffic to the nearest available node.
# Sample BGP configuration snippet (Cisco IOS)
router bgp 64512
network 203.0.113.0 mask 255.255.255.0
neighbor 192.0.2.1 remote-as 64512
2. Global Server Load Balancing (GSLB)
GSLB solutions like F5 BIG-IP or AWS Route 53 Traffic Flow use real-time health checks and faster DNS updates than traditional DNS failover.
# AWS Route 53 health check configuration
resource "aws_route53_health_check" "web" {
ip_address = "192.0.2.1"
port = 80
type = "HTTP"
resource_path = "/health"
failure_threshold = 2
request_interval = 30
}
3. Reverse Proxy with Health Checks
HAProxy or Nginx can monitor backend servers and instantly reroute traffic:
# HAProxy configuration example
backend web_servers
balance roundrobin
option httpchk GET /health
server web1 192.0.2.1:80 check
server web2 192.0.2.2:80 check backup
DNS failover can work for:
- Non-critical services with high TTL tolerance
- Complementary failover alongside other methods
- Scenarios where sub-second failover isn't required
The key is understanding your RTO (Recovery Time Objective) and whether DNS propagation delays fit within that window.
When architecting your failover solution:
# Python script to test failover responsiveness
import requests
from time import time
def test_failover(url, expected_ip):
start = time()
try:
resolved_ip = requests.get(url).headers['X-Server-IP']
return {
'success': resolved_ip == expected_ip,
'response_time': time() - start
}
except Exception as e:
return {'error': str(e)}
Always test your failover mechanism under controlled conditions before relying on it in production.