DNS Nameserver Fallback Behavior: Investigating Resolution Failures and NS Record Selection Algorithms


10 views

When a recursive resolver encounters multiple NS records, its behavior isn't as straightforward as simple round-robin selection. The resolver typically:

  • Maintains internal rankings of nameserver responsiveness
  • Implements various fallback strategies
  • May cache failed attempts beyond the TTL period
# Example showing NS record TTL inspection
dig example.com NS +nocmd +nocomments +nostats
;; ANSWER SECTION:
example.com.        3600    IN  NS  ns1.example.com.
example.com.        3600    IN  NS  ns2.example.com.

Several factors contribute to persistent resolution failures:

  • Negative Caching: Some resolvers implement SERVFAIL caching (RFC 2308)
  • Sticky Nameserver Selection: Resolvers often stick with previously successful servers
  • Implementation Differences: OpenDNS handles fallback differently than Google DNS

While the SOA record's primary nameserver (MNAME) often gets preference, this isn't standardized. Testing shows:

# Checking SOA record (note the MNAME field)
dig example.com SOA +short
ns1.example.com. hostmaster.example.com. 2023081501 3600 1800 604800 3600

Major public resolvers exhibit these patterns:

Resolver Fallback Speed SOA Preference
Google DNS ~5 minutes Moderate
OpenDNS ~15 minutes Strong
Cloudflare ~2 minutes Weak

Common implementations use variations of these approaches:

// Pseudocode for typical resolver logic
function selectNameserver(nsList, soaMname) {
  if (hasCachedSuccess(nsList)) {
    return getFastestCached(nsList);
  }
  if (implementation === 'bind') {
    return soaMname || nsList[0];
  }
  return shuffled(nsList)[0]; // Some use randomized selection
}
  • Monitor all nameservers independently
  • Consider Anycast implementations for critical DNS
  • Test failure scenarios with various public resolvers
  • Implement DNS health checks that verify resolution from multiple networks
# Health check script example
#!/bin/bash
RESOLVERS=("8.8.8.8" "1.1.1.1" "208.67.222.222")
DOMAIN="example.com"

for resolver in "${RESOLVERS[@]}"; do
  if ! dig @"$resolver" "$DOMAIN" +short >/dev/null; then
    echo "FAIL: $resolver" >&2
  fi
done

During a recent incident where our primary nameserver (ns1.example.com) became unavailable, we noticed persistent resolution failures from major public resolvers (OpenDNS, Verizon, Earthlink) even after the 1-hour NS record TTL expired. Manual verification confirmed the secondary server (ns2.example.com) was operational:

dig @ns2.example.com www.example.com +short
;; ANSWER SECTION:
www.example.com. 300 IN A 192.0.2.1

Contrary to common assumptions, DNS resolvers don't implement simple round-robin or strict TTL-based failover. Key behaviors observed:

  • SOA Preference: Many resolvers prioritize the server listed first in NS records or the SOA MNAME field
  • Negative Caching: SERVFAIL responses may be cached despite NS TTL expiration (RFC 2308)
  • Health Checking:
    /* Pseudocode for resolver health check logic */
    if (last_query_failed && (now - last_failure < retry_interval)) {
        continue_using_alternate_server();
    } else {
        retry_primary_server();
    }

To diagnose resolver behavior during failures:

# Check NS record propagation
dig +trace example.com NS | grep "example.com. IN NS"

# Query specific resolver's cache status
dig @8.8.8.8 example.com +norecurse +ttlid

# Force TTL expiration test
dnsmasq --test --server=/example.com/ns2.example.com
Resolver Failover Time Retry Logic
Google DNS ~5 minutes Exponential backoff
OpenDNS TTL-based Persistent SERVFAIL caching
Cloudflare Immediate Simultaneous NS queries

For optimal resilience:

; Zone file snippet
example.com. 86400 IN SOA ns1.example.com. hostmaster.example.com. (
    2023081501 ; serial
    3600       ; refresh (1 hour)
    600        ; retry (10 min)
    1209600    ; expire (2 weeks)
    300 )      ; minimum (5 min)

example.com. 3600 IN NS ns1.example.com.
example.com. 3600 IN NS ns2.example.com.
example.com. 3600 IN NS ns3.example.com.

Key considerations:

  • Maintain at least 3 geographically distributed nameservers
  • Use identical zone data across all servers (AXFR/IXFR synchronization)
  • Monitor resolver behavior using dnstap or packet capture