Troubleshooting AWS RDS Connection Timeouts: When Identical Configurations Yield Different Results

Having nearly identical AWS RDS instances where one connects flawlessly while the other times out is more common than you'd think. Let me walk through a comprehensive diagnostic approach I've developed after encountering this exact scenario multiple times in production environments.

While security groups and VPC settings are the usual suspects, here are the less obvious factors I always check first:


# First, verify endpoint resolution
nslookup your-new-rds-endpoint.rds.amazonaws.com
nslookup your-old-rds-endpoint.rds.amazonaws.com

# Check route tables for the subnet
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-xyz"

Security groups get all the attention, but Network ACLs can block traffic without warning. Run this to compare:


aws ec2 describe-network-acls \
    --filters "Name=association.subnet-id,Values=subnet-xyz" \
    --query "NetworkAcls[*].{Entries:Entries,SubnetAssociations:Associations}"

Identical instances can behave differently if they're using different parameter groups. Verify with:


aws rds describe-db-instances \
    --db-instance-identifier your-new-rds \
    --query "DBInstances[0].DBParameterGroups"

When all else fails, this sequence never lets me down:

Test basic connectivity: telnet your-rds-endpoint 3306
Verify security group ingress rules: aws ec2 describe-security-groups --group-ids sg-xyz
Check VPC flow logs for blocked traffic
Test from an EC2 instance in the same subnet
Try connecting via AWS Session Manager tunnel

In one memorable debugging session, the issue turned out to be an implicit dependency on a peering connection that wasn't properly propagated to the route tables. The solution involved:


# Add explicit route for RDS endpoint
aws ec2 create-route \
    --route-table-id rtb-abc123 \
    --destination-cidr-block 10.0.0.0/16 \
    --vpc-peering-connection-id pcx-xyz789

For MySQL/MariaDB users, add these diagnostic queries to your toolkit:


SHOW VARIABLES LIKE '%connect%';
SHOW STATUS LIKE 'Threads_connected';
SHOW PROCESSLIST;

For PostgreSQL, these often reveal connection bottlenecks:


SELECT * FROM pg_settings WHERE name LIKE '%connect%';
SELECT count(*) FROM pg_stat_activity;

Never overlook CloudWatch metrics. These are particularly telling:


aws cloudwatch get-metric-statistics \
    --namespace AWS/RDS \
    --metric-name DatabaseConnections \
    --dimensions Name=DBInstanceIdentifier,Value=your-new-rds \
    --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" --date="-5 minutes") \
    --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
    --period 60 \
    --statistics Maximum

When time is critical, I use this script to rebuild with proper connectivity from the start:


#!/bin/bash
# Create new RDS with verified connectivity settings
aws rds create-db-instance \
    --db-instance-identifier new-rds-verified \
    --db-instance-class db.t3.micro \
    --engine mysql \
    --allocated-storage 20 \
    --vpc-security-group-ids sg-verified123 \
    --db-subnet-group default-vpc-xyz \
    --publicly-accessible \
    --no-multi-az \
    --master-username admin \
    --master-user-password "secure-password" \
    --backup-retention-period 0

Recently encountered a head-scratcher where two nearly identical RDS instances in the same VPC/subnet behaved differently - one connected flawlessly while the other threw connection timeouts. Here's how I methodically diagnosed and resolved it.

Before diving deep, let's verify the basics:


# Connectivity test for both instances
telnet problematic-db.123456789012.us-east-1.rds.amazonaws.com 3306
telnet working-db.123456789012.us-east-1.rds.amazonaws.com 3306

# Security group verification (AWS CLI)
aws ec2 describe-security-groups --group-ids sg-12345678

Even though instances shared the same subnet group, AWS creates multiple subnets across availability zones. The key discovery:


# Check route tables for each subnet
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-12345678"

# Compare output between working and non-working subnets

The problematic instance was in a subnet lacking a route to the internet gateway. While RDS doesn't need outbound internet access for basic operation, missing routes can indicate broader networking issues.

Parameter groups can silently affect connectivity. Verify with:


aws rds describe-db-parameter-groups \
    --db-parameter-group-name my-parameter-group

Key parameters to check:

skip_name_resolve
bind_address
max_connections

When all else fails, enable VPC flow logs to see where packets are being dropped:


aws ec2 create-flow-logs \
    --resource-type Subnet \
    --resource-ids subnet-12345678 \
    --traffic-type ALL \
    --log-group-name VPCFlowLogs \
    --deliver-logs-permission-arn arn:aws:logs:us-east-1:123456789012:log-group:VPCFlowLogs

Here's a Python script to automate connectivity checks:


import socket
import time

def test_rds_connection(endpoint, port=3306, timeout=5):
    try:
        start = time.time()
        with socket.create_connection((endpoint, port), timeout=timeout):
            return True, time.time() - start
    except Exception as e:
        return False, str(e)

if __name__ == "__main__":
    endpoints = [
        "problematic-db.123456789012.us-east-1.rds.amazonaws.com",
        "working-db.123456789012.us-east-1.rds.amazonaws.com"
    ]
    
    for endpoint in endpoints:
        success, result = test_rds_connection(endpoint)
        print(f"{endpoint}: {'Success' if success else 'Failed'} ({result})")

In my case, these actions resolved the timeout:

Verified NACL rules weren't blocking traffic
Recreated the RDS instance in a different AZ
Triple-checked parameter group assignments
Confirmed proper DNS resolution in the VPC

ServerDevWorker

Troubleshooting AWS RDS Connection Timeouts: When Identical Configurations Yield Different Results

Related Articles