Automating AMI Rotation in AWS Auto Scaling Groups with Zero Downtime


1 views

When managing production infrastructure with AWS Auto Scaling Groups (ASGs), one common pain point is updating the underlying Amazon Machine Images (AMIs) while maintaining availability. The current manual process of scaling up/down works but introduces operational overhead and potential downtime windows.

Here are proven approaches to automate AMI rotation:

# CloudFormation example using UpdatePolicy
"MyASG": {
  "Type": "AWS::AutoScaling::AutoScalingGroup",
  "UpdatePolicy": {
    "AutoScalingRollingUpdate": {
      "MaxBatchSize": "2",
      "MinInstancesInService": "1",
      "PauseTime": "PT5M",
      "WaitOnResourceSignals": "true"
    }
  }
}

For critical production systems, consider creating a parallel ASG with the new AMI:

  1. Create new launch template with updated AMI
  2. Stand up new ASG pointing to same ELB
  3. Gradually shift traffic using ELB weights
  4. Decommission old ASG after validation

SSM Automation Documents can orchestrate the entire process:

aws ssm create-automation-execution \
  --document-name "AWS-UpdateLinuxAmi" \
  --parameters "AutomationAssumeRole=arn:aws:iam::123456789012:role/AutomationServiceRole,SourceAmiId=ami-12345678,InstanceIamRole=MyInstanceProfile,TargetAmiName=web-app-{{timestamp}}"

For GitOps workflows, integrate AMI updates into your CI/CD pipeline:

# Sample Jenkins pipeline stage
stage('Update ASG') {
  steps {
    script {
      def newLT = aws.ec2.createLaunchTemplateVersion(
        launchTemplateId: 'lt-0123456789abcdef',
        sourceVersion: '1',
        amiId: params.AMI_ID
      )
      aws.autoscaling.updateAutoScalingGroup(
        autoScalingGroupName: 'web-app-asg',
        launchTemplate: [
          launchTemplateId: 'lt-0123456789abcdef',
          version: newLT.versionNumber
        ]
      )
    }
  }
}
  • Always test new AMIs in staging first
  • Monitor health checks during rotation
  • Consider canary deployments for major changes
  • Implement proper rollback procedures

When managing web applications on AWS, we often face the dilemma of updating Amazon Machine Images (AMIs) while maintaining continuous availability. The current approach of manually scaling up/down works but introduces operational overhead and potential service disruption.

Here are effective methods to automate AMI rotation in your Auto Scaling Groups (ASGs):

1. Using AWS Systems Manager (SSM) Automation

This native AWS solution provides the most integrated approach. Create an SSM Automation document that:

  • Creates a new launch template version with the updated AMI
  • Gradually replaces instances using rolling updates
  • Verifies health checks before proceeding

# Sample AWS CLI command to start the automation
aws ssm start-automation-execution \
  --document-name "AWS-UpdateAutoScalingGroup" \
  --parameters '{
    "AutoScalingGroupName":["your-asg-name"],
    "LaunchTemplateName":["your-launch-template"],
    "LaunchTemplateVersion":["$LATEST"],
    "MinHealthyPercentage":["90"],
    "WaitOnResourceSignals":["false"]
  }'

2. AWS CodePipeline Integration

For CI/CD pipelines, you can trigger AMI updates through CodePipeline:


# CloudFormation snippet for Pipeline configuration
Resources:
  AMIUpdatePipeline:
    Type: AWS::CodePipeline::Pipeline
    Properties:
      Stages:
        - Name: Source
          Actions:
            - Name: SourceAction
              ActionTypeId:
                Category: Source
                Owner: AWS
                Provider: CodeCommit
              Configuration:
                RepositoryName: your-repo
                BranchName: main
        - Name: Build
          Actions:
            - Name: BuildAMIAction
              ActionTypeId:
                Category: Build
                Owner: AWS
                Provider: CodeBuild
              Configuration:
                ProjectName: your-build-project
        - Name: Deploy
          Actions:
            - Name: UpdateASG
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: AutoScaling
              Configuration:
                LaunchTemplateName: your-template
                AutoScalingGroupName: your-asg

3. Custom Lambda Function Solution

For maximum control, implement a Lambda function triggered by CloudWatch Events:


import boto3
import time

def lambda_handler(event, context):
    autoscaling = boto3.client('autoscaling')
    ec2 = boto3.client('ec2')
    
    # Get current ASG configuration
    asg = autoscaling.describe_auto_scaling_groups(
        AutoScalingGroupNames=['your-asg-name']
    )['AutoScalingGroups'][0]
    
    # Create new launch template version with updated AMI
    new_launch_template = ec2.create_launch_template_version(
        LaunchTemplateName='your-template',
        SourceVersion='$LATEST',
        LaunchTemplateData={
            'ImageId': 'ami-1234567890abcdef0'
        }
    )
    
    # Update ASG with new launch template
    autoscaling.update_auto_scaling_group(
        AutoScalingGroupName='your-asg-name',
        LaunchTemplate={
            'LaunchTemplateName': 'your-template',
            'Version': str(new_launch_template['LaunchTemplateVersion']['VersionNumber'])
        },
        MinSize=asg['MinSize'],
        MaxSize=asg['MaxSize'],
        DesiredCapacity=asg['DesiredCapacity']
    )
    
    # Implement instance refresh
    refresh = autoscaling.start_instance_refresh(
        AutoScalingGroupName='your-asg-name',
        Preferences={
            'MinHealthyPercentage': 90,
            'InstanceWarmup': 300
        }
    )
    
    return {
        'statusCode': 200,
        'body': f"Instance refresh initiated: {refresh['InstanceRefreshId']}"
    }
  • Always test new AMIs in a staging environment first
  • Implement health checks that accurately reflect application state
  • Use canary deployments when possible (gradual rollout)
  • Monitor CloudWatch metrics during rotation
  • Set appropriate instance warm-up times

Implement these CloudWatch Alarms to detect issues:


aws cloudwatch put-metric-alarm \
  --alarm-name "ASG-HealthCheck-Failures" \
  --metric-name "HealthyHostCount" \
  --namespace "AWS/AutoScaling" \
  --statistic "Average" \
  --period 60 \
  --threshold 2 \
  --comparison-operator "LessThanThreshold" \
  --dimensions "Name=AutoScalingGroupName,Value=your-asg-name" \
  --evaluation-periods 2 \
  --alarm-actions "arn:aws:sns:us-east-1:123456789012:your-sns-topic"

For rollback scenarios, maintain previous launch template versions and implement automation to revert if alarms trigger.