When a recursive resolver encounters multiple NS records, its behavior isn't as straightforward as simple round-robin selection. The resolver typically:
- Maintains internal rankings of nameserver responsiveness
- Implements various fallback strategies
- May cache failed attempts beyond the TTL period
# Example showing NS record TTL inspection
dig example.com NS +nocmd +nocomments +nostats
;; ANSWER SECTION:
example.com. 3600 IN NS ns1.example.com.
example.com. 3600 IN NS ns2.example.com.
Several factors contribute to persistent resolution failures:
- Negative Caching: Some resolvers implement SERVFAIL caching (RFC 2308)
- Sticky Nameserver Selection: Resolvers often stick with previously successful servers
- Implementation Differences: OpenDNS handles fallback differently than Google DNS
While the SOA record's primary nameserver (MNAME) often gets preference, this isn't standardized. Testing shows:
# Checking SOA record (note the MNAME field)
dig example.com SOA +short
ns1.example.com. hostmaster.example.com. 2023081501 3600 1800 604800 3600
Major public resolvers exhibit these patterns:
Resolver | Fallback Speed | SOA Preference |
---|---|---|
Google DNS | ~5 minutes | Moderate |
OpenDNS | ~15 minutes | Strong |
Cloudflare | ~2 minutes | Weak |
Common implementations use variations of these approaches:
// Pseudocode for typical resolver logic
function selectNameserver(nsList, soaMname) {
if (hasCachedSuccess(nsList)) {
return getFastestCached(nsList);
}
if (implementation === 'bind') {
return soaMname || nsList[0];
}
return shuffled(nsList)[0]; // Some use randomized selection
}
- Monitor all nameservers independently
- Consider Anycast implementations for critical DNS
- Test failure scenarios with various public resolvers
- Implement DNS health checks that verify resolution from multiple networks
# Health check script example
#!/bin/bash
RESOLVERS=("8.8.8.8" "1.1.1.1" "208.67.222.222")
DOMAIN="example.com"
for resolver in "${RESOLVERS[@]}"; do
if ! dig @"$resolver" "$DOMAIN" +short >/dev/null; then
echo "FAIL: $resolver" >&2
fi
done
During a recent incident where our primary nameserver (ns1.example.com) became unavailable, we noticed persistent resolution failures from major public resolvers (OpenDNS, Verizon, Earthlink) even after the 1-hour NS record TTL expired. Manual verification confirmed the secondary server (ns2.example.com) was operational:
dig @ns2.example.com www.example.com +short
;; ANSWER SECTION:
www.example.com. 300 IN A 192.0.2.1
Contrary to common assumptions, DNS resolvers don't implement simple round-robin or strict TTL-based failover. Key behaviors observed:
- SOA Preference: Many resolvers prioritize the server listed first in NS records or the SOA MNAME field
- Negative Caching: SERVFAIL responses may be cached despite NS TTL expiration (RFC 2308)
- Health Checking:
/* Pseudocode for resolver health check logic */ if (last_query_failed && (now - last_failure < retry_interval)) { continue_using_alternate_server(); } else { retry_primary_server(); }
To diagnose resolver behavior during failures:
# Check NS record propagation
dig +trace example.com NS | grep "example.com. IN NS"
# Query specific resolver's cache status
dig @8.8.8.8 example.com +norecurse +ttlid
# Force TTL expiration test
dnsmasq --test --server=/example.com/ns2.example.com
Resolver | Failover Time | Retry Logic |
---|---|---|
Google DNS | ~5 minutes | Exponential backoff |
OpenDNS | TTL-based | Persistent SERVFAIL caching |
Cloudflare | Immediate | Simultaneous NS queries |
For optimal resilience:
; Zone file snippet
example.com. 86400 IN SOA ns1.example.com. hostmaster.example.com. (
2023081501 ; serial
3600 ; refresh (1 hour)
600 ; retry (10 min)
1209600 ; expire (2 weeks)
300 ) ; minimum (5 min)
example.com. 3600 IN NS ns1.example.com.
example.com. 3600 IN NS ns2.example.com.
example.com. 3600 IN NS ns3.example.com.
Key considerations:
- Maintain at least 3 geographically distributed nameservers
- Use identical zone data across all servers (AXFR/IXFR synchronization)
- Monitor resolver behavior using
dnstap
or packet capture