ALB Connection Draining Not Completing Early Despite No Active Connections: AWS ECS Deployment Issue


5 views

During our ECS deployments using CloudFormation, we noticed ALB target groups consistently taking the full deregistration delay (default 300 seconds) to remove old containers, even when no active connections existed. This contradicts AWS's official documentation which states:

"Elastic Load Balancing immediately completes the deregistration process [...] if a deregistering target has no in-flight requests and no active connections."

After extensive testing with minimal traffic (only developer requests), we identified several non-obvious factors:

# CloudFormation snippet showing problematic health check configuration
TargetGroup:
  Type: AWS::ElasticLoadBalancingV2::TargetGroup
  Properties:
    HealthCheckIntervalSeconds: 30
    HealthCheckTimeoutSeconds: 5
    HealthyThresholdCount: 2
    UnhealthyThresholdCount: 2
    HealthCheckPath: /status
    HealthCheckPort: 8080
    HealthCheckProtocol: HTTP
  1. Health Check Connections: ALB continues health checks during draining
  2. TCP Keepalives: Modern HTTP clients maintain persistent connections
  3. ECS Service Timing: Container shutdown sequence affects connection termination

We implemented these adjustments across our deployment pipeline:

// AWS CLI command to modify deregistration behavior
aws elbv2 modify-target-group-attributes \
  --target-group-arn arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 \
  --attributes Key=deregistration_delay.timeout_seconds,Value=30
  • Reduced deregistration delay from 300s to 30s for faster cycling
  • Added connection termination in application shutdown hook
  • Configured TCP keepalive timeout (3s) in ALB target group

Use these CloudWatch metrics to monitor actual draining behavior:

Metric Namespace Significance
HealthyHostCount AWS/ApplicationELB Shows actual registration state
RequestCount AWS/ApplicationELB Verifies zero traffic during drain

This JSON snippet shows proper lifecycle hooks for connection cleanup:

{
  "containerDefinitions": [
    {
      "name": "web",
      "image": "nginx:latest",
      "lifecycle": {
        "preStop": {
          "command": [
            "sh",
            "-c",
            "sleep 5 && kill -SIGTERM $(cat /var/run/nginx.pid)"
          ]
        }
      }
    }
  ]
}

For containers not responding to SIGTERM, consider adding TCP connection tracking and forced termination in your application's shutdown sequence.


During ECS service updates through CloudFormation, I noticed ALB target groups consistently take the full 300 seconds (default deregistration delay) to complete draining, even when no active connections exist. This contradicts AWS documentation stating deregistration should complete immediately when no in-flight requests exist.

To isolate the issue, I created a minimal test environment:

Resources:
  TestService:
    Type: AWS::ECS::Service
    Properties:
      DeploymentConfiguration:
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
        MaximumPercent: 200
        MinimumHealthyPercent: 100
      LoadBalancers:
        - ContainerName: "web"
          ContainerPort: 80
          TargetGroupArn: !Ref TargetGroup
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: "ENABLED"
          SecurityGroups:
            - !Ref SecurityGroup
          Subnets: !Split [",", !ImportValue "PrivateSubnets"]

Through CloudWatch Metrics and ALB access logs, I identified three potential culprits:

  • Health check connections being counted as active
  • TCP keep-alive from ALB to targets
  • ECS task networking cleanup latency

Here's the working configuration that solved the issue:

TargetGroup:
  Type: AWS::ElasticLoadBalancingV2::TargetGroup
  Properties:
    HealthCheckIntervalSeconds: 10
    HealthCheckTimeoutSeconds: 6
    HealthyThresholdCount: 2
    UnhealthyThresholdCount: 2
    TargetType: ip
    Port: 80
    Protocol: HTTP
    DeregistrationDelayTimeoutSeconds: 30 # Reduced from default 300
    VpcId: !ImportValue "VpcId"

For HTTP/HTTPS services, implement connection timeouts in your application:

# Nginx configuration example
keepalive_timeout 10s;
keepalive_requests 100;
send_timeout 60s;

Use these AWS CLI commands to monitor deregistration:

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn [TARGET_GROUP_ARN] \
  --query 'TargetHealthDescriptions[?TargetHealth.State==draining]'

# Check network connections (requires SSM access)
aws ssm send-command \
  --instance-ids [INSTANCE_ID] \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["netstat -anp | grep ESTABLISHED"]'