Having nearly identical AWS RDS instances where one connects flawlessly while the other times out is more common than you'd think. Let me walk through a comprehensive diagnostic approach I've developed after encountering this exact scenario multiple times in production environments.
While security groups and VPC settings are the usual suspects, here are the less obvious factors I always check first:
# First, verify endpoint resolution
nslookup your-new-rds-endpoint.rds.amazonaws.com
nslookup your-old-rds-endpoint.rds.amazonaws.com
# Check route tables for the subnet
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-xyz"
Security groups get all the attention, but Network ACLs can block traffic without warning. Run this to compare:
aws ec2 describe-network-acls \
--filters "Name=association.subnet-id,Values=subnet-xyz" \
--query "NetworkAcls[*].{Entries:Entries,SubnetAssociations:Associations}"
Identical instances can behave differently if they're using different parameter groups. Verify with:
aws rds describe-db-instances \
--db-instance-identifier your-new-rds \
--query "DBInstances[0].DBParameterGroups"
When all else fails, this sequence never lets me down:
- Test basic connectivity:
telnet your-rds-endpoint 3306
- Verify security group ingress rules:
aws ec2 describe-security-groups --group-ids sg-xyz
- Check VPC flow logs for blocked traffic
- Test from an EC2 instance in the same subnet
- Try connecting via AWS Session Manager tunnel
In one memorable debugging session, the issue turned out to be an implicit dependency on a peering connection that wasn't properly propagated to the route tables. The solution involved:
# Add explicit route for RDS endpoint
aws ec2 create-route \
--route-table-id rtb-abc123 \
--destination-cidr-block 10.0.0.0/16 \
--vpc-peering-connection-id pcx-xyz789
For MySQL/MariaDB users, add these diagnostic queries to your toolkit:
SHOW VARIABLES LIKE '%connect%';
SHOW STATUS LIKE 'Threads_connected';
SHOW PROCESSLIST;
For PostgreSQL, these often reveal connection bottlenecks:
SELECT * FROM pg_settings WHERE name LIKE '%connect%';
SELECT count(*) FROM pg_stat_activity;
Never overlook CloudWatch metrics. These are particularly telling:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=your-new-rds \
--start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" --date="-5 minutes") \
--end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--period 60 \
--statistics Maximum
When time is critical, I use this script to rebuild with proper connectivity from the start:
#!/bin/bash
# Create new RDS with verified connectivity settings
aws rds create-db-instance \
--db-instance-identifier new-rds-verified \
--db-instance-class db.t3.micro \
--engine mysql \
--allocated-storage 20 \
--vpc-security-group-ids sg-verified123 \
--db-subnet-group default-vpc-xyz \
--publicly-accessible \
--no-multi-az \
--master-username admin \
--master-user-password "secure-password" \
--backup-retention-period 0
Recently encountered a head-scratcher where two nearly identical RDS instances in the same VPC/subnet behaved differently - one connected flawlessly while the other threw connection timeouts. Here's how I methodically diagnosed and resolved it.
Before diving deep, let's verify the basics:
# Connectivity test for both instances
telnet problematic-db.123456789012.us-east-1.rds.amazonaws.com 3306
telnet working-db.123456789012.us-east-1.rds.amazonaws.com 3306
# Security group verification (AWS CLI)
aws ec2 describe-security-groups --group-ids sg-12345678
Even though instances shared the same subnet group, AWS creates multiple subnets across availability zones. The key discovery:
# Check route tables for each subnet
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-12345678"
# Compare output between working and non-working subnets
The problematic instance was in a subnet lacking a route to the internet gateway. While RDS doesn't need outbound internet access for basic operation, missing routes can indicate broader networking issues.
Parameter groups can silently affect connectivity. Verify with:
aws rds describe-db-parameter-groups \
--db-parameter-group-name my-parameter-group
Key parameters to check:
- skip_name_resolve
- bind_address
- max_connections
When all else fails, enable VPC flow logs to see where packets are being dropped:
aws ec2 create-flow-logs \
--resource-type Subnet \
--resource-ids subnet-12345678 \
--traffic-type ALL \
--log-group-name VPCFlowLogs \
--deliver-logs-permission-arn arn:aws:logs:us-east-1:123456789012:log-group:VPCFlowLogs
Here's a Python script to automate connectivity checks:
import socket
import time
def test_rds_connection(endpoint, port=3306, timeout=5):
try:
start = time.time()
with socket.create_connection((endpoint, port), timeout=timeout):
return True, time.time() - start
except Exception as e:
return False, str(e)
if __name__ == "__main__":
endpoints = [
"problematic-db.123456789012.us-east-1.rds.amazonaws.com",
"working-db.123456789012.us-east-1.rds.amazonaws.com"
]
for endpoint in endpoints:
success, result = test_rds_connection(endpoint)
print(f"{endpoint}: {'Success' if success else 'Failed'} ({result})")
In my case, these actions resolved the timeout:
- Verified NACL rules weren't blocking traffic
- Recreated the RDS instance in a different AZ
- Triple-checked parameter group assignments
- Confirmed proper DNS resolution in the VPC