DNS Nameserver Fallback Behavior: Investigating Resolution Failures and NS Record Selection Algorithms

When a recursive resolver encounters multiple NS records, its behavior isn't as straightforward as simple round-robin selection. The resolver typically:

Maintains internal rankings of nameserver responsiveness
Implements various fallback strategies
May cache failed attempts beyond the TTL period

# Example showing NS record TTL inspection
dig example.com NS +nocmd +nocomments +nostats
;; ANSWER SECTION:
example.com.        3600    IN  NS  ns1.example.com.
example.com.        3600    IN  NS  ns2.example.com.

Several factors contribute to persistent resolution failures:

Negative Caching: Some resolvers implement SERVFAIL caching (RFC 2308)
Sticky Nameserver Selection: Resolvers often stick with previously successful servers
Implementation Differences: OpenDNS handles fallback differently than Google DNS

While the SOA record's primary nameserver (MNAME) often gets preference, this isn't standardized. Testing shows:

# Checking SOA record (note the MNAME field)
dig example.com SOA +short
ns1.example.com. hostmaster.example.com. 2023081501 3600 1800 604800 3600

Major public resolvers exhibit these patterns:

Resolver	Fallback Speed	SOA Preference
Google DNS	~5 minutes	Moderate
OpenDNS	~15 minutes	Strong
Cloudflare	~2 minutes	Weak

Common implementations use variations of these approaches:

// Pseudocode for typical resolver logic
function selectNameserver(nsList, soaMname) {
  if (hasCachedSuccess(nsList)) {
    return getFastestCached(nsList);
  }
  if (implementation === 'bind') {
    return soaMname || nsList[0];
  }
  return shuffled(nsList)[0]; // Some use randomized selection
}

Monitor all nameservers independently
Consider Anycast implementations for critical DNS
Test failure scenarios with various public resolvers
Implement DNS health checks that verify resolution from multiple networks

# Health check script example
#!/bin/bash
RESOLVERS=("8.8.8.8" "1.1.1.1" "208.67.222.222")
DOMAIN="example.com"

for resolver in "${RESOLVERS[@]}"; do
  if ! dig @"$resolver" "$DOMAIN" +short >/dev/null; then
    echo "FAIL: $resolver" >&2
  fi
done

During a recent incident where our primary nameserver (ns1.example.com) became unavailable, we noticed persistent resolution failures from major public resolvers (OpenDNS, Verizon, Earthlink) even after the 1-hour NS record TTL expired. Manual verification confirmed the secondary server (ns2.example.com) was operational:

dig @ns2.example.com www.example.com +short
;; ANSWER SECTION:
www.example.com. 300 IN A 192.0.2.1

Contrary to common assumptions, DNS resolvers don't implement simple round-robin or strict TTL-based failover. Key behaviors observed:

SOA Preference: Many resolvers prioritize the server listed first in NS records or the SOA MNAME field
Negative Caching: SERVFAIL responses may be cached despite NS TTL expiration (RFC 2308)

Health Checking:

/* Pseudocode for resolver health check logic */
if (last_query_failed && (now - last_failure < retry_interval)) {
    continue_using_alternate_server();
} else {
    retry_primary_server();
}

To diagnose resolver behavior during failures:

# Check NS record propagation
dig +trace example.com NS | grep "example.com. IN NS"

# Query specific resolver's cache status
dig @8.8.8.8 example.com +norecurse +ttlid

# Force TTL expiration test
dnsmasq --test --server=/example.com/ns2.example.com

Resolver	Failover Time	Retry Logic
Google DNS	~5 minutes	Exponential backoff
OpenDNS	TTL-based	Persistent SERVFAIL caching
Cloudflare	Immediate	Simultaneous NS queries

For optimal resilience:

; Zone file snippet
example.com. 86400 IN SOA ns1.example.com. hostmaster.example.com. (
    2023081501 ; serial
    3600       ; refresh (1 hour)
    600        ; retry (10 min)
    1209600    ; expire (2 weeks)
    300 )      ; minimum (5 min)

example.com. 3600 IN NS ns1.example.com.
example.com. 3600 IN NS ns2.example.com.
example.com. 3600 IN NS ns3.example.com.

Key considerations:

Maintain at least 3 geographically distributed nameservers
Use identical zone data across all servers (AXFR/IXFR synchronization)
Monitor resolver behavior using dnstap or packet capture

ServerDevWorker

DNS Nameserver Fallback Behavior: Investigating Resolution Failures and NS Record Selection Algorithms

Related Articles