We recently encountered a puzzling scenario with our AWS infrastructure where our Elastic Load Balancer (ELB) showed intermittent connectivity issues across different geographical locations. The setup was standard:
Route53 → CNAME → ELB → EC2 instances (HTTP servers)
Despite having healthy instances and proper DNS configuration, we observed that approximately 30-40% of connection attempts would fail, particularly from specific regions.
When troubleshooting, we noticed several key patterns:
- DNS resolution appeared correct even when connections failed
- TCP SYN packets were sent but no response received
- Issues occurred across multiple DNS providers (including 8.8.8.8)
- ELB monitoring showed significant traffic drops during outage periods
Here's how we systematically approached the problem:
1. DNS Resolution Analysis
We used dig commands to verify the actual resolution behavior:
for i in {1..10}; do
dig +short our-domain.com
sleep 5
done
Surprisingly, while the resolution was correct, we noticed the ELB's underlying IPs changed more frequently than expected.
2. TCP Connection Testing
We developed a simple Python script to test TCP connectivity:
import socket
from datetime import datetime
def test_elb_connection(host, port=80):
try:
start = datetime.now()
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.settimeout(5)
s.connect((host, port))
s.close()
return True, (datetime.now() - start).total_seconds()
except Exception as e:
return False, str(e)
3. Network Path Analysis
Using mtr and traceroute, we discovered that some network paths were being routed through problematic intermediate hops:
mtr --report --report-cycles 10 --tcp --port 80 elb-dns-name
After extensive testing and AWS support consultations, we uncovered that:
- ELBs do frequently change their underlying IP addresses (part of AWS's scaling mechanism)
- Some ISPs cache DNS records beyond the TTL (60 seconds for ELBs)
- Certain network middleboxes drop packets to "old" ELB IPs aggressively
1. DNS Caching Layer
We implemented a local DNS caching layer with forced TTL refresh:
# dnsmasq configuration
cache-size=1000
max-cache-ttl=60
min-cache-ttl=30
2. Connection Retry Logic
Added intelligent retry logic in our application code:
def get_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
return response
except (requests.ConnectionError, requests.Timeout):
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
3. Network Optimization
For critical paths, we implemented direct connect with:
aws ec2 create-vpc-endpoint \
--vpc-id vpc-1a2b3c4d \
--service-name com.amazonaws.us-east-1.elasticloadbalancing \
--route-table-ids rtb-11aa22bb
We set up CloudWatch alarms for:
- Unhealthy host count
- DNS resolution failures
- TCP connection timeouts
The complete solution reduced our outage incidents by 98% and improved overall connection reliability.
In our AWS environment with multiple HTTP servers behind an Elastic Load Balancer (ELB), we've documented a peculiar behavior where certain geographical locations experience intermittent connection failures. Despite proper DNS resolution (confirmed via dig
and nslookup
), TCP handshakes fail consistently from affected networks.
# Sample verification commands we ran:
dig ourdomain.com +short
# Returns correct ELB DNS (e.g., our-elb-1234567890.us-west-2.elb.amazonaws.com)
telnet our-elb-1234567890.us-west-2.elb.amazonaws.com 80
# Connection fails from problematic networks
While Amazon's initial suggestion about ISP DNS caching seemed plausible, our tests invalidated this theory:
- Verified using EC2's default DNS (AmazonProvidedDNS)
- Tested with Google DNS (8.8.8.8/8.8.4.4)
- Confirmed with OpenDNS resolvers
The ELB's underlying IP changes shouldn't cause this since the CNAME should always resolve to current endpoints.
Packet captures revealed SYN packets weren't receiving SYN-ACK responses from certain networks:
# tcpdump example showing failed handshake
13:42:15.123456 IP client.ip > elb.ip: Flags [S], seq 123456789, win 65535
13:42:18.123456 IP client.ip > elb.ip: Flags [S], seq 123456789, win 65535
13:42:21.123456 IP client.ip > elb.ip: Flags [S], seq 123456789, win 65535
We implemented and tested these solutions:
# CloudFront distribution configuration example
resource "aws_cloudfront_distribution" "elb_distribution" {
origin {
domain_name = "${aws_elb.our_elb.dns_name}"
origin_id = "elb-origin"
custom_origin_config {
http_port = 80
https_port = 443
origin_protocol_policy = "https-only"
origin_ssl_protocols = ["TLSv1.2"]
}
}
# Additional CF config...
}
Alternative approach using Route53 latency-based routing:
# Route53 weighted records for failover
resource "aws_route53_record" "elb_primary" {
zone_id = aws_route53_zone.main.zone_id
name = "ourdomain.com"
type = "CNAME"
ttl = "60"
records = [aws_elb.our_elb.dns_name]
}
resource "aws_route53_record" "elb_secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "ourdomain.com"
type = "A"
ttl = "60"
records = ["1.2.3.4"] # Backup IP
weighted_routing_policy {
weight = 10
}
}
The root cause appears to be network path MTU issues between certain ISPs and AWS's ELB infrastructure. Implementing these changes resolved our issues:
- Enabled TCP Keepalives on our instances
- Reduced ELB idle timeout to 30 seconds
- Implemented CloudFront as a caching layer
For immediate debugging, we recommend:
# ELB connection test script
#!/bin/bash
ELB_DNS="your-elb-dns-name"
while true; do
date
ping -c 1 $ELB_DNS
timeout 2 telnet $ELB_DNS 80
echo "-----"
sleep 5
done