Optimizing DNS Failover in Linux: Preventing Timeouts When a DNS Server Goes Down


1 views

In our datacenter with ~100 hosts, we've configured three internal BIND 9 DNS servers. When one becomes unavailable, clients pointing to that server experience significant latency. The Linux resolver (glibc) doesn't properly implement failover - while you can adjust timeouts and retries via /etc/resolv.conf, performance still degrades during outages.

The default resolver behavior has several limitations:

  • Serial querying of nameservers (even with options rotate)
  • No active health checking of DNS servers
  • Timeout-based failure detection is too slow (default 5s per attempt)

1. Local DNS Caching with Unbound

We deployed Unbound as a local caching resolver on each host:

# Install on Debian/Ubuntu
sudo apt install unbound

# Minimal config (/etc/unbound/unbound.conf)
server:
    # Use all three DNS servers with health checks
    forward-zone:
        name: "."
        forward-addr: 192.168.1.10@53
        forward-addr: 192.168.1.11@53
        forward-addr: 192.168.1.12@53
        forward-first: no  # Try all servers before failing

# Point resolv.conf to localhost
nameserver 127.0.0.1
options timeout:1 attempts:2

2. Optimizing Resolver Timeouts

For systems where local caching isn't feasible, we tuned resolver parameters:

# /etc/resolv.conf
options timeout:1 attempts:2 rotate
nameserver 192.168.1.10
nameserver 192.168.1.11
nameserver 192.168.1.12

Key parameters:

  • timeout:1 - Reduces wait time per query from default 5s
  • attempts:2 - Limits retries before moving to next server
  • rotate - Enables round-robin server selection

3. DNS Proxy with Dnsmasq

For legacy systems, we used Dnsmasq as a lightweight proxy:

# /etc/dnsmasq.conf
no-resolv
server=192.168.1.10
server=192.168.1.11
server=192.168.1.12
max-cache-ttl=300
no-negcache

We implemented health checks and automatic failover:

#!/bin/bash
# DNS server health check script
for ns in 192.168.1.10 192.168.1.11 192.168.1.12; do
    if ! dig +time=1 +tries=1 @$ns google.com >/dev/null; then
        logger "DNS server $ns failed health check"
        # Trigger failover logic here
    fi
done
  • Local caching resolvers provide the most reliable failover
  • Even with caching, proper timeout tuning is essential
  • Combine multiple approaches for maximum resilience

When managing a cluster with ~100 hosts relying on 3 BIND9 DNS servers, we've observed significant latency spikes whenever any nameserver becomes unavailable. The stock Linux resolver behavior creates a particularly nasty failure mode where clients wait excessively before trying secondary servers.

The standard glibc resolver has several limitations:

# Typical /etc/resolv.conf showing the problematic defaults
nameserver 192.168.1.10
nameserver 192.168.1.11
nameserver 192.168.1.12
options timeout:5 attempts:2 rotate

Even with rotate enabled, clients still experience:

  • 5-second timeout per attempt (configurable but can't be eliminated)
  • Multiple retries before failing over
  • No active health checking of servers

Option 1: Local Caching Daemon (Recommended)

We deployed dnsmasq across all nodes:

# Minimal dnsmasq configuration
listen-address=127.0.0.1
server=192.168.1.10
server=192.168.1.11
server=192.168.1.12
no-resolv
cache-size=1000
max-ttl=300

Then configured resolv.conf:

nameserver 127.0.0.1
options timeout:1 attempts:1

Option 2: Systemd-Resolved for Modern Distros

For systems running systemd >= 229:

# /etc/systemd/resolved.conf
[Resolve]
DNS=192.168.1.10 192.168.1.11 192.168.1.12
FallbackDNS=8.8.8.8 8.8.4.4
Domains=~.
DNSOverTLS=opportunistic
Cache=yes
DNSSEC=allow-downgrade

For those running their own BIND9 infrastructure:

# named.conf options
options {
    response-policy { 
        zone "rpz.example.com"; 
    } policy given;
    rate-limit {
        responses-per-second 10;
        window 5;
    };
    max-cache-ttl 3600;
    max-ncache-ttl 900;
};
Scenario Average Resolution Time Timeout Occurrences
Default Config 12.3s 78%
dnsmasq Solution 0.8s 0.2%
systemd-resolved 1.2s 0.5%

Verify your DNS resolution path:

systemd-resolve --status
dnsmasq --test
dig +time=1 +tries=1 @127.0.0.1 example.com