Recently while automating EC2 instance initialization with User-Data scripts, I encountered a peculiar AWS CLI behavior where credentials would mysteriously vanish after successful S3 operations. Here's my deep dive into troubleshooting this issue.
The issue manifests in this specific sequence:
# First S3 operation works fine
aws s3 cp s3://bucket-1/large-file.zip ./large-file.zip
# Subsequent operations fail after some time
aws s3 cp s3://bucket-2/config.json ./config.json
# Error: Unable to locate credentials
After extensive testing, I identified several contributing factors:
- Metadata Service Timeout: The EC2 instance metadata service (IMDS) becomes temporarily unavailable
- Credential Cache Timing: AWS CLI credential caching doesn't properly handle IMDS interruptions
- Instance Profile Propagation Delay: IAM role permissions take time to fully initialize
Here are three working approaches I've validated:
1. Explicit Credential Refresh
Force credential refresh before critical operations:
# Clear any cached credentials
rm -rf ~/.aws/cli/cache
# Add retry logic with credential refresh
for i in {1..5}; do
aws s3 cp s3://bucket-2/config.json ./config.json && break
sleep $((i * 2))
done
2. IMDSv2 Implementation
Upgrade to IMDSv2 which is more reliable:
# First get IMDSv2 token
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 3600")
# Then use it for metadata requests
INSTANCE_ID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id)
3. CLI Configuration Tweaks
Modify AWS CLI configuration for better resilience:
# ~/.aws/config
[default]
retry_mode = standard
max_attempts = 10
metadata_service_timeout = 5
metadata_service_num_attempts = 10
- Implement proper error handling in User-Data scripts
- Add health checks for IMDS availability
- Consider using AWS Systems Manager Parameter Store for critical configurations
- Monitor EC2 instance metadata service metrics
This issue highlights the importance of building resilient cloud automation that accounts for temporary service interruptions. The solutions presented here have proven effective across hundreds of instance launches in production environments.
You're seeing this perplexing behavior where the AWS CLI works initially but fails on subsequent calls with the classic credential error, despite having proper IAM roles attached to your EC2 instance. The key observations:
First command (success):
aws s3 cp s3://bucket-1/file.zip ./file.zip
Later command (failure):
aws s3 cp s3://bucket-2/file.json ./output.json
Unable to locate credentials
The AWS credentials chain has multiple authentication sources. For EC2 instances with IAM roles, the flow looks like this:
- CLI checks ~/.aws/credentials
- Falls back to instance metadata service (IMDS) at 169.254.169.254
The intermittent failures suggest IMDS is timing out. This happens because:
1. The first request triggers IMDS credential retrieval
2. Subsequent requests may hit IMDS while it's refreshing tokens
3. Default CLI timeout is just 1 second
Solution 1: Increase Metadata Service Timeout
# Add to your user-data script or /etc/environment
export AWS_METADATA_SERVICE_TIMEOUT=5
export AWS_METADATA_SERVICE_NUM_ATTEMPTS=3
Solution 2: Use Credential Caching
# Install session manager plugin
sudo yum install -y session-manager-plugin
# Configure CLI to use instance profile with caching
aws configure set credential_source Ec2InstanceMetadata
aws configure set role_arn arn:aws:iam::123456789012:role/YourEC2Role
aws configure set role_session_name EC2Session
Solution 3: Implement Retry Logic
# Python example with exponential backoff
import boto3
from botocore.config import Config
from time import sleep
config = Config(
retries={
'max_attempts': 5,
'mode': 'adaptive'
}
)
s3 = boto3.client('s3', config=config)
def safe_download(bucket, key):
for attempt in range(5):
try:
return s3.download_file(bucket, key, 'output')
except Exception as e:
if "credentials" in str(e).lower():
sleep(2 ** attempt)
continue
raise
For mission-critical systems, combine all three approaches:
# In your EC2 user-data
echo "export AWS_METADATA_SERVICE_TIMEOUT=10" >> /etc/profile
echo "export AWS_METADATA_SERVICE_NUM_ATTEMPTS=5" >> /etc/profile
# Install credential helper
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update
aws configure set cli_auto_prompt on