When clients demand "100% uptime," experienced engineers immediately recognize this as a miscommunication about fault tolerance rather than an absolute requirement. In our case, the client wants continuous external access even during catastrophic on-premise failures (floods, power outages, etc.), while acknowledging internal access limitations.
The core challenge lies in DNS propagation delays during failover events. Traditional approaches like:
# Basic DNS TTL configuration example
$ORIGIN example.com.
@ IN SOA ns1.example.com. admin.example.com. (
2023081501 ; serial
3600 ; refresh
600 ; retry
86400 ; expire
60 ; minimum TTL
)
Even with aggressive TTL settings (60 seconds in this example), real-world DNS caching often causes longer delays.
For true seamless transition, consider these architectural approaches:
1. Global Server Load Balancing (GSLB)
// Sample GSLB health check configuration (F5 BIG-IP syntax)
ltm global-settings {
dos-device-config {
application none
}
}
ltm dns gslb wideip example.com {
pools {
primary_pool {
members {
primary-server:80 {
address 192.0.2.1
}
secondary-server:80 {
address 203.0.113.1
}
}
monitor http
}
}
topology {
region "NA" {
server primary-server
alternate secondary-server
}
}
}
2. Anycast Routing
Example BGP configuration for anycast:
router bgp 65530
neighbor 192.0.2.2 remote-as 65530
network 203.0.113.0/24
exit
A practical architecture might combine:
- DNS-based failover for regional outages
- Anycast for immediate traffic redirection
- Cloudflare Argo Smart Routing for optimized paths
Solution costs scale with desired recovery time:
Solution | Approximate Cost | Failover Time |
---|---|---|
Basic DNS | $ | 5-60 min |
GSLB | $$$ | 30-120 sec |
Anycast + GSLB | $$$$ | Instant |
For this specific scenario where the client insists on external continuity during on-premise failures:
- Implement multi-cloud deployment with GSLB
- Use persistent storage solutions like Amazon S3 Cross-Region Replication
- Establish automated health checks with sub-second intervals
Let's address the elephant in the room first - true 100% uptime is mathematically impossible. Even major cloud providers like AWS and Google Cloud offer 99.99% SLA (which still allows for about 52 minutes of downtime per year). However, we can architect systems that approach 100% availability through intelligent failover mechanisms.
The core challenge isn't the application architecture (which you've already designed for horizontal scaling), but rather the networking layer. Here are some proven approaches:
# Example of a health check script for automatic failover
#!/bin/bash
PRIMARY_ENDPOINT="primary.example.com"
SECONDARY_ENDPOINT="failover.example.com"
if curl --silent --fail --max-time 5 $PRIMARY_ENDPOINT/health; then
echo "Primary is healthy"
exit 0
else
echo "Primary down, failing over"
aws route53 change-resource-record-sets \
--hosted-zone-id Z1PA6795UKMFR9 \
--change-batch file://failover.json
exit 1
fi
While DNS failover (like Route53) can work, it's not instantaneous due to:
- TTL propagation delays (even with low TTLs)
- Client-side DNS caching
- ISP-level DNS caching
A better approach combines DNS with other techniques:
// Example of client-side failover detection
const endpoints = [
'https://primary.api.example.com',
'https://secondary.api.example.com'
];
async function fetchWithFailover(urls, options, retries = 3) {
for (let i = 0; i < urls.length && retries > 0; i++) {
try {
const response = await fetch(urls[i], options);
if (response.ok) return response;
} catch (e) {
console.warn(`Endpoint ${urls[i]} failed, trying next`);
}
}
throw new Error('All endpoints failed');
}
Enterprise solutions like:
- F5 BIG-IP DNS
- Citrix ADC
- Cloudflare Load Balancing
Can provide sub-second failover by:
- Continuous health monitoring
- Anycast IP routing
- BGP route propagation
As you correctly noted, internal users present an unsolvable challenge during complete site outages. The only realistic options are:
- VPN failover to external endpoints
- SD-WAN solutions with automatic path selection
- Starlink as backup internet (for physical location disasters)
For maximum availability:
# Example multi-region deployment with Terraform
resource "aws_instance" "primary" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.large"
subnet_id = aws_subnet.primary.id
count = 3
}
resource "aws_instance" "secondary" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.large"
subnet_id = aws_subnet.secondary.id
count = 3
}
resource "aws_route53_health_check" "primary" {
ip_address = aws_instance.primary[0].private_ip
port = 80
type = "HTTP"
resource_path = "/health"
failure_threshold = "2"
request_interval = "30"
}
When clients insist on 100% uptime, educate them about:
- The law of diminishing returns (99.99% vs 99.999% costs)
- Single points of failure in their own infrastructure
- The actual business impact of minutes vs hours of downtime
Ultimately, the most reliable solution is to host in a cloud provider with multiple availability zones and let them handle the infrastructure redundancy.