DNS Failover Techniques: Implementing Backup A Records and Alternatives for High Availability


2 views

While DNS has specialized record types like MX (mail exchange) and NS (name server) that support priority-based failover through preference values (e.g., MX 10 and MX 20), standard A records don't natively include this functionality. This creates challenges when trying to implement primary-backup server architectures at the DNS level.

For specific services, DNS does provide priority mechanisms:


; Mail server example with priorities
example.com.  IN  MX  10 mail1.example.com.
example.com.  IN  MX  20 mail2.example.com.

; Nameserver example
example.com.  IN  NS  ns1.example.com.
example.com.  IN  NS  ns2.example.com.

When you need backup A records, consider these approaches:

1. DNS TTL Optimization

Reduce TTL (Time To Live) to allow rapid DNS changes during outages:


example.com.  300  IN  A  192.0.2.1  ; 5 minute TTL

2. Round-Robin DNS

List multiple IPs and let clients choose:


example.com.  IN  A  192.0.2.1
example.com.  IN  A  192.0.2.2

3. Health-Check Based DNS

Implement dynamic DNS updates with health checks using tools like:

  • AWS Route 53 failover
  • Azure Traffic Manager
  • PowerDNS with health check scripts

Here's a Python script that monitors server health and updates DNS records via API:


import requests
import dns.update
import dns.query

def check_server_health(ip):
    try:
        response = requests.get(f"http://{ip}/health", timeout=2)
        return response.status_code == 200
    except:
        return False

def update_dns_record(primary_ip, backup_ip):
    update = dns.update.Update('example.com')
    update.replace('@', 300, 'A', backup_ip if not check_server_health(primary_ip) else primary_ip)
    dns.query.tcp(update, 'ns1.example.com')
Provider Feature Implementation
Cloudflare Load balancing Health checks + failover
AWS Route 53 Failover routing Active-passive configuration
DNS Made Easy Failover A records HTTP/S monitoring
  • DNS propagation delays (even with low TTL)
  • Client-side DNS caching behavior
  • Health check frequency and monitoring costs
  • False positive scenarios

While DNS supports backup NS (nameserver) records and MX (mail server) records with priority values, there's no native mechanism for A record failover in the DNS protocol itself. When you query for A records, DNS servers typically return all available records in random order, with no inherent priority system.

Here are several proven approaches to implement high availability for your services:

# Example DNS zone file showing multiple A records
example.com.    300 IN  A   192.0.2.1
example.com.    300 IN  A   192.0.2.2
example.com.    300 IN  A   192.0.2.3

Modern applications should implement their own failover logic when multiple IPs are returned:

// JavaScript implementation of client-side failover
async function fetchWithRetry(url, ips, options = {}) {
  for (const ip of ips) {
    try {
      const modifiedUrl = url.replace(/^https?:\/\//, http://${ip}/);
      const response = await fetch(modifiedUrl, {
        ...options,
        headers: { ...options.headers, Host: new URL(url).hostname }
      });
      return response;
    } catch (error) {
      console.log(Failed to connect to ${ip}, trying next...);
    }
  }
  throw new Error('All servers unavailable');
}

Some DNS providers offer custom solutions:

  • DNS Made Easy: Failover system that monitors servers
  • Amazon Route 53: Health checks and DNS failover
  • Cloudflare: Load balancing with health checks

For critical services, consider implementing a monitoring system that updates DNS records dynamically:

# Python example using Route 53 API
import boto3
from healthcheck import check_server

def update_dns_based_on_health():
    route53 = boto3.client('route53')
    healthy_ips = [ip for ip in ['192.0.2.1', '192.0.2.2'] if check_server(ip)]
    
    if healthy_ips:
        route53.change_resource_record_sets(
            HostedZoneId='Z1PA6795UKMFR9',
            ChangeBatch={
                'Changes': [{
                    'Action': 'UPSERT',
                    'ResourceRecordSet': {
                        'Name': 'example.com',
                        'Type': 'A',
                        'TTL': 60,
                        'ResourceRecords': [{'Value': ip} for ip in healthy_ips]
                    }
                }]
            }
        )
  • Use low TTL values (30-60 seconds) for dynamic records
  • Implement monitoring for both primary and backup servers
  • Consider geographic distribution of backup servers
  • Test failover procedures regularly
  • Document your failover strategy clearly