Implementing 100% Uptime Web Applications: DNS Failover Strategies for External Traffic During On-Premise Outages


4 views

When clients demand "100% uptime," experienced engineers immediately recognize this as a miscommunication about fault tolerance rather than an absolute requirement. In our case, the client wants continuous external access even during catastrophic on-premise failures (floods, power outages, etc.), while acknowledging internal access limitations.

The core challenge lies in DNS propagation delays during failover events. Traditional approaches like:

# Basic DNS TTL configuration example
$ORIGIN example.com.
@ IN SOA ns1.example.com. admin.example.com. (
  2023081501 ; serial
  3600       ; refresh
  600        ; retry
  86400      ; expire
  60         ; minimum TTL
)

Even with aggressive TTL settings (60 seconds in this example), real-world DNS caching often causes longer delays.

For true seamless transition, consider these architectural approaches:

1. Global Server Load Balancing (GSLB)

// Sample GSLB health check configuration (F5 BIG-IP syntax)
ltm global-settings {
  dos-device-config {
    application none
  }
}
ltm dns gslb wideip example.com {
  pools {
    primary_pool {
      members {
        primary-server:80 {
          address 192.0.2.1
        }
        secondary-server:80 {
          address 203.0.113.1
        }
      }
      monitor http
    }
  }
  topology {
    region "NA" {
      server primary-server
      alternate secondary-server
    }
  }
}

2. Anycast Routing

Example BGP configuration for anycast:

router bgp 65530
 neighbor 192.0.2.2 remote-as 65530
 network 203.0.113.0/24
exit

A practical architecture might combine:

  • DNS-based failover for regional outages
  • Anycast for immediate traffic redirection
  • Cloudflare Argo Smart Routing for optimized paths

Solution costs scale with desired recovery time:

Solution Approximate Cost Failover Time
Basic DNS $ 5-60 min
GSLB $$$ 30-120 sec
Anycast + GSLB $$$$ Instant

For this specific scenario where the client insists on external continuity during on-premise failures:

  1. Implement multi-cloud deployment with GSLB
  2. Use persistent storage solutions like Amazon S3 Cross-Region Replication
  3. Establish automated health checks with sub-second intervals

Let's address the elephant in the room first - true 100% uptime is mathematically impossible. Even major cloud providers like AWS and Google Cloud offer 99.99% SLA (which still allows for about 52 minutes of downtime per year). However, we can architect systems that approach 100% availability through intelligent failover mechanisms.

The core challenge isn't the application architecture (which you've already designed for horizontal scaling), but rather the networking layer. Here are some proven approaches:

# Example of a health check script for automatic failover
#!/bin/bash
PRIMARY_ENDPOINT="primary.example.com"
SECONDARY_ENDPOINT="failover.example.com"

if curl --silent --fail --max-time 5 $PRIMARY_ENDPOINT/health; then
    echo "Primary is healthy"
    exit 0
else
    echo "Primary down, failing over"
    aws route53 change-resource-record-sets \
        --hosted-zone-id Z1PA6795UKMFR9 \
        --change-batch file://failover.json
    exit 1
fi

While DNS failover (like Route53) can work, it's not instantaneous due to:

  • TTL propagation delays (even with low TTLs)
  • Client-side DNS caching
  • ISP-level DNS caching

A better approach combines DNS with other techniques:

// Example of client-side failover detection
const endpoints = [
  'https://primary.api.example.com',
  'https://secondary.api.example.com'
];

async function fetchWithFailover(urls, options, retries = 3) {
  for (let i = 0; i < urls.length && retries > 0; i++) {
    try {
      const response = await fetch(urls[i], options);
      if (response.ok) return response;
    } catch (e) {
      console.warn(`Endpoint ${urls[i]} failed, trying next`);
    }
  }
  throw new Error('All endpoints failed');
}

Enterprise solutions like:

  • F5 BIG-IP DNS
  • Citrix ADC
  • Cloudflare Load Balancing

Can provide sub-second failover by:

  1. Continuous health monitoring
  2. Anycast IP routing
  3. BGP route propagation

As you correctly noted, internal users present an unsolvable challenge during complete site outages. The only realistic options are:

  • VPN failover to external endpoints
  • SD-WAN solutions with automatic path selection
  • Starlink as backup internet (for physical location disasters)

For maximum availability:

# Example multi-region deployment with Terraform
resource "aws_instance" "primary" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.large"
  subnet_id     = aws_subnet.primary.id
  count         = 3
}

resource "aws_instance" "secondary" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.large"
  subnet_id     = aws_subnet.secondary.id
  count         = 3
}

resource "aws_route53_health_check" "primary" {
  ip_address        = aws_instance.primary[0].private_ip
  port              = 80
  type              = "HTTP"
  resource_path     = "/health"
  failure_threshold = "2"
  request_interval  = "30"
}

When clients insist on 100% uptime, educate them about:

  • The law of diminishing returns (99.99% vs 99.999% costs)
  • Single points of failure in their own infrastructure
  • The actual business impact of minutes vs hours of downtime

Ultimately, the most reliable solution is to host in a cloud provider with multiple availability zones and let them handle the infrastructure redundancy.