Troubleshooting ELB DNS Resolution and TCP Connection Failures in AWS: A Deep Dive


2 views

We recently encountered a puzzling scenario with our AWS infrastructure where our Elastic Load Balancer (ELB) showed intermittent connectivity issues across different geographical locations. The setup was standard:

Route53 → CNAME → ELB → EC2 instances (HTTP servers)

Despite having healthy instances and proper DNS configuration, we observed that approximately 30-40% of connection attempts would fail, particularly from specific regions.

When troubleshooting, we noticed several key patterns:

  • DNS resolution appeared correct even when connections failed
  • TCP SYN packets were sent but no response received
  • Issues occurred across multiple DNS providers (including 8.8.8.8)
  • ELB monitoring showed significant traffic drops during outage periods

Here's how we systematically approached the problem:

1. DNS Resolution Analysis

We used dig commands to verify the actual resolution behavior:

for i in {1..10}; do 
  dig +short our-domain.com
  sleep 5
done

Surprisingly, while the resolution was correct, we noticed the ELB's underlying IPs changed more frequently than expected.

2. TCP Connection Testing

We developed a simple Python script to test TCP connectivity:

import socket
from datetime import datetime

def test_elb_connection(host, port=80):
    try:
        start = datetime.now()
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.settimeout(5)
        s.connect((host, port))
        s.close()
        return True, (datetime.now() - start).total_seconds()
    except Exception as e:
        return False, str(e)

3. Network Path Analysis

Using mtr and traceroute, we discovered that some network paths were being routed through problematic intermediate hops:

mtr --report --report-cycles 10 --tcp --port 80 elb-dns-name

After extensive testing and AWS support consultations, we uncovered that:

  • ELBs do frequently change their underlying IP addresses (part of AWS's scaling mechanism)
  • Some ISPs cache DNS records beyond the TTL (60 seconds for ELBs)
  • Certain network middleboxes drop packets to "old" ELB IPs aggressively

1. DNS Caching Layer

We implemented a local DNS caching layer with forced TTL refresh:

# dnsmasq configuration
cache-size=1000
max-cache-ttl=60
min-cache-ttl=30

2. Connection Retry Logic

Added intelligent retry logic in our application code:

def get_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            return response
        except (requests.ConnectionError, requests.Timeout):
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

3. Network Optimization

For critical paths, we implemented direct connect with:

aws ec2 create-vpc-endpoint \
    --vpc-id vpc-1a2b3c4d \
    --service-name com.amazonaws.us-east-1.elasticloadbalancing \
    --route-table-ids rtb-11aa22bb

We set up CloudWatch alarms for:

  • Unhealthy host count
  • DNS resolution failures
  • TCP connection timeouts

The complete solution reduced our outage incidents by 98% and improved overall connection reliability.


In our AWS environment with multiple HTTP servers behind an Elastic Load Balancer (ELB), we've documented a peculiar behavior where certain geographical locations experience intermittent connection failures. Despite proper DNS resolution (confirmed via dig and nslookup), TCP handshakes fail consistently from affected networks.

# Sample verification commands we ran:
dig ourdomain.com +short
# Returns correct ELB DNS (e.g., our-elb-1234567890.us-west-2.elb.amazonaws.com)

telnet our-elb-1234567890.us-west-2.elb.amazonaws.com 80
# Connection fails from problematic networks

While Amazon's initial suggestion about ISP DNS caching seemed plausible, our tests invalidated this theory:

  • Verified using EC2's default DNS (AmazonProvidedDNS)
  • Tested with Google DNS (8.8.8.8/8.8.4.4)
  • Confirmed with OpenDNS resolvers

The ELB's underlying IP changes shouldn't cause this since the CNAME should always resolve to current endpoints.

Packet captures revealed SYN packets weren't receiving SYN-ACK responses from certain networks:

# tcpdump example showing failed handshake
13:42:15.123456 IP client.ip > elb.ip: Flags [S], seq 123456789, win 65535
13:42:18.123456 IP client.ip > elb.ip: Flags [S], seq 123456789, win 65535
13:42:21.123456 IP client.ip > elb.ip: Flags [S], seq 123456789, win 65535

We implemented and tested these solutions:

# CloudFront distribution configuration example
resource "aws_cloudfront_distribution" "elb_distribution" {
  origin {
    domain_name = "${aws_elb.our_elb.dns_name}"
    origin_id   = "elb-origin"
    
    custom_origin_config {
      http_port = 80
      https_port = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols = ["TLSv1.2"]
    }
  }
  # Additional CF config...
}

Alternative approach using Route53 latency-based routing:

# Route53 weighted records for failover
resource "aws_route53_record" "elb_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "ourdomain.com"
  type    = "CNAME"
  ttl     = "60"
  records = [aws_elb.our_elb.dns_name]
}

resource "aws_route53_record" "elb_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "ourdomain.com"
  type    = "A"
  ttl     = "60"
  records = ["1.2.3.4"] # Backup IP
  weighted_routing_policy {
    weight = 10
  }
}

The root cause appears to be network path MTU issues between certain ISPs and AWS's ELB infrastructure. Implementing these changes resolved our issues:

  1. Enabled TCP Keepalives on our instances
  2. Reduced ELB idle timeout to 30 seconds
  3. Implemented CloudFront as a caching layer

For immediate debugging, we recommend:

# ELB connection test script
#!/bin/bash
ELB_DNS="your-elb-dns-name"
while true; do
  date
  ping -c 1 $ELB_DNS
  timeout 2 telnet $ELB_DNS 80
  echo "-----"
  sleep 5
done