In our datacenter with ~100 hosts, we've configured three internal BIND 9 DNS servers. When one becomes unavailable, clients pointing to that server experience significant latency. The Linux resolver (glibc) doesn't properly implement failover - while you can adjust timeouts and retries via /etc/resolv.conf
, performance still degrades during outages.
The default resolver behavior has several limitations:
- Serial querying of nameservers (even with
options rotate
) - No active health checking of DNS servers
- Timeout-based failure detection is too slow (default 5s per attempt)
1. Local DNS Caching with Unbound
We deployed Unbound as a local caching resolver on each host:
# Install on Debian/Ubuntu
sudo apt install unbound
# Minimal config (/etc/unbound/unbound.conf)
server:
# Use all three DNS servers with health checks
forward-zone:
name: "."
forward-addr: 192.168.1.10@53
forward-addr: 192.168.1.11@53
forward-addr: 192.168.1.12@53
forward-first: no # Try all servers before failing
# Point resolv.conf to localhost
nameserver 127.0.0.1
options timeout:1 attempts:2
2. Optimizing Resolver Timeouts
For systems where local caching isn't feasible, we tuned resolver parameters:
# /etc/resolv.conf
options timeout:1 attempts:2 rotate
nameserver 192.168.1.10
nameserver 192.168.1.11
nameserver 192.168.1.12
Key parameters:
timeout:1
- Reduces wait time per query from default 5sattempts:2
- Limits retries before moving to next serverrotate
- Enables round-robin server selection
3. DNS Proxy with Dnsmasq
For legacy systems, we used Dnsmasq as a lightweight proxy:
# /etc/dnsmasq.conf
no-resolv
server=192.168.1.10
server=192.168.1.11
server=192.168.1.12
max-cache-ttl=300
no-negcache
We implemented health checks and automatic failover:
#!/bin/bash
# DNS server health check script
for ns in 192.168.1.10 192.168.1.11 192.168.1.12; do
if ! dig +time=1 +tries=1 @$ns google.com >/dev/null; then
logger "DNS server $ns failed health check"
# Trigger failover logic here
fi
done
- Local caching resolvers provide the most reliable failover
- Even with caching, proper timeout tuning is essential
- Combine multiple approaches for maximum resilience
When managing a cluster with ~100 hosts relying on 3 BIND9 DNS servers, we've observed significant latency spikes whenever any nameserver becomes unavailable. The stock Linux resolver behavior creates a particularly nasty failure mode where clients wait excessively before trying secondary servers.
The standard glibc resolver has several limitations:
# Typical /etc/resolv.conf showing the problematic defaults
nameserver 192.168.1.10
nameserver 192.168.1.11
nameserver 192.168.1.12
options timeout:5 attempts:2 rotate
Even with rotate enabled, clients still experience:
- 5-second timeout per attempt (configurable but can't be eliminated)
- Multiple retries before failing over
- No active health checking of servers
Option 1: Local Caching Daemon (Recommended)
We deployed dnsmasq across all nodes:
# Minimal dnsmasq configuration
listen-address=127.0.0.1
server=192.168.1.10
server=192.168.1.11
server=192.168.1.12
no-resolv
cache-size=1000
max-ttl=300
Then configured resolv.conf:
nameserver 127.0.0.1
options timeout:1 attempts:1
Option 2: Systemd-Resolved for Modern Distros
For systems running systemd >= 229:
# /etc/systemd/resolved.conf
[Resolve]
DNS=192.168.1.10 192.168.1.11 192.168.1.12
FallbackDNS=8.8.8.8 8.8.4.4
Domains=~.
DNSOverTLS=opportunistic
Cache=yes
DNSSEC=allow-downgrade
For those running their own BIND9 infrastructure:
# named.conf options
options {
response-policy {
zone "rpz.example.com";
} policy given;
rate-limit {
responses-per-second 10;
window 5;
};
max-cache-ttl 3600;
max-ncache-ttl 900;
};
Scenario | Average Resolution Time | Timeout Occurrences |
---|---|---|
Default Config | 12.3s | 78% |
dnsmasq Solution | 0.8s | 0.2% |
systemd-resolved | 1.2s | 0.5% |
Verify your DNS resolution path:
systemd-resolve --status
dnsmasq --test
dig +time=1 +tries=1 @127.0.0.1 example.com