Round-Robin DNS for High Availability: Client Failover Behavior Analysis


3 views

Round-robin DNS operates by returning multiple IP addresses for a domain in a rotating order. When configured with two IPs (A records) for example.com, DNS queries will alternate between them:


example.com.    IN  A  192.0.2.1
example.com.    IN  A  203.0.113.2

The fundamental limitation emerges when one IP becomes unavailable. Standard DNS behavior doesn't include health checks - the failed IP remains in rotation. Client behavior varies:

  • Modern browsers implement "Happy Eyeballs" (RFC 8305) attempting parallel connections
  • Many applications simply try the first returned IP and fail on timeout
  • DNS cache TTLs delay failover to alternative IPs

Let's simulate the behavior using a simple HTTP request:


# First attempt (gets unresponsive IP)
$ curl -v http://example.com
* Trying 192.0.2.1:80...
* connect to 192.0.2.1 port 80 failed: Connection timed out

# Second attempt after DNS cache expires
$ curl -v http://example.com
* Trying 203.0.113.2:80...
* Connected to example.com (203.0.113.2) port 80

For true high availability, consider these enhanced approaches:


# DNS-based solution with health checks (AWS Route53 example)
resource "aws_route53_health_check" "backend" {
  ip_address        = "192.0.2.1"
  port              = 80
  type              = "HTTP"
  resource_path     = "/health"
  failure_threshold = 2
  request_interval  = 30
}

# Application-layer retry pattern (Python example)
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(
    total=3,
    connect=3,
    read=3,
    status=3,
    backoff_factor=0.5,
    allowed_methods=frozenset(['GET', 'POST'])
)
session.mount('http://', HTTPAdapter(max_retries=retries))

Despite limitations, round-robin works well for:

  • Load distribution across healthy endpoints
  • Blue-green deployments with controlled cutovers
  • Geographic distribution when combined with EDNS Client Subnet

Round-robin DNS distributes requests across multiple IP addresses in a rotating fashion, but it's crucial to understand that DNS itself provides no health checking or automatic failover mechanism. When a client receives multiple IPs from a DNS response, the typical behavior is:


// Example DNS response with two A records
example.com.    300 IN  A  192.0.2.1
example.com.    300 IN  A  203.0.113.2

Most modern operating systems and HTTP clients implement "happy eyeballs" algorithms that attempt parallel connections:

  1. The client gets both IP addresses from DNS
  2. Attempts to connect to the first IP
  3. If no response within timeout (typically 300ms-1s), tries the next IP
  4. This happens at the TCP layer, before any application protocol handshake

Different technologies handle this differently:

Web Browsers

Modern browsers (Chrome, Firefox) implement sophisticated connection strategies:


// Chrome's connection behavior pseudocode
async function connect(url) {
  const ips = await dns.resolve(url);
  const connections = ips.map(ip => tryConnect(ip));
  return Promise.any(connections);
}

Programming Languages

Most HTTP libraries will automatically try alternative IPs:


// Python requests example
import requests
try:
    response = requests.get('http://example.com', timeout=5)
except requests.exceptions.ConnectTimeout:
    # The library already tried all IPs before failing
    handle_failure()

Round-robin DNS alone isn't a complete HA solution because:

  • DNS caching means clients may continue trying dead IPs until TTL expires
  • No awareness of server health or load
  • Uneven distribution if some clients cache DNS longer than others

For production systems, consider combining with:

Health-Checking DNS

Services like Amazon Route 53 or NS1 provide DNS with health checks:


# Route 53 health check configuration
resource "aws_route53_health_check" "example" {
  ip_address        = "192.0.2.1"
  port              = 80
  type              = "HTTP"
  resource_path     = "/health"
  failure_threshold = 3
}

Client-Side Retry Logic

Implement explicit retries in your application code:


// Node.js with retry logic
const axiosRetry = require('axios-retry');
const axios = require('axios');

axiosRetry(axios, { 
  retries: 3,
  retryCondition: (error) => {
    return axiosRetry.isNetworkError(error) || 
      (error.response && error.response.status >= 500);
  }
});

When using round-robin DNS, implement:

  • DNS resolution monitoring to ensure all IPs are returned
  • Endpoint availability checks for each IP
  • TTL expiration tracking to detect caching issues