That moment when your EC2 instance becomes unresponsive and SSH connections time out is every sysadmin's nightmare. The Amazon monitoring graph clearly shows CPU pegged at 100%, but without process-level visibility, you're left guessing what caused the spike.
When the system is still barely responsive, try these quick commands before resorting to reboot:
# Quick process snapshot (works even with high load)
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 20
# Alternative when ps is too slow
top -b -n 1 | head -n 20
Once you've regained access, dig deeper with these forensic tools:
# Install sysstat if not present
sudo yum install sysstat -y # For Amazon Linux
sudo apt-get install sysstat # For Ubuntu/Debian
# View historical CPU usage per process
sar -u ALL 1 3
# Check for CPU steal (common in virtualized environments)
sar -u ALL 1 3 | grep -i steal
Prevent future blindspots by configuring these monitoring solutions:
# Configure sysstat for detailed historical data
sudo sed -i 's/^HISTORY=.*/HISTORY=30/' /etc/sysconfig/sysstat
sudo systemctl enable sysstat
sudo systemctl start sysstat
# Install and configure atop for advanced process accounting
sudo yum install atop -y
sudo systemctl enable atop
sudo systemctl start atop
Create this watchdog script to catch CPU spikes before they become critical:
#!/bin/bash
THRESHOLD=90
ALERT_EMAIL="admin@example.com"
while true; do
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}')
if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
PROCESS_LIST=$(ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 10)
echo -e "High CPU usage detected: ${CPU_USAGE}%\n\nTop processes:\n${PROCESS_LIST}" | \
mail -s "CPU Alert on $(hostname)" "$ALERT_EMAIL"
fi
sleep 60
done
For AWS environments, combine OS-level tools with CloudWatch metrics:
# Install CloudWatch agent for enhanced metrics
sudo yum install amazon-cloudwatch-agent -y
# Configure custom metrics to monitor specific processes
{
"metrics": {
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"procstat": [
{
"pattern": "java|python|node",
"measurement": [
"cpu_usage",
"memory_rss"
]
}
]
}
}
}
When your EC2 instance hits 100% CPU utilization, it often becomes unresponsive - exactly when you need diagnostic tools the most. Here's how to prepare for and investigate these situations:
Install these tools before you encounter issues:
# For historical process tracking
sudo apt-get install sysstat atop
# For real-time monitoring
sudo apt-get install htop glances
When SSH fails, use these AWS-specific approaches:
# 1. AWS Systems Manager (SSM)
aws ssm start-session --target instance-id
# 2. EC2 Serial Console
# Requires IAM permissions and console access
Once you regain access, gather evidence:
Using SAR for Historical Data
# Show CPU usage history (adjust -f for specific date)
sar -u -f /var/log/sa/sa$(date +%d -d yesterday)
# Show individual process statistics
sar -q -f /var/log/sa/sa$(date +%d -d yesterday)
ATOP Process Accounting
# View recorded process activity
atop -r /var/log/atop/atop_$(date +%Y%m%d -d yesterday)
When the system is responsive:
# Batch version showing PIDs and commands
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -n 20
# Interactive version with thread details
top -H -p $(pgrep -d',' -f "high_cpu_process")
Create a monitoring script in /usr/local/bin/monitor_cpu.sh:
#!/bin/bash
THRESHOLD=90
LOG_FILE=/var/log/cpu_monitor.log
PROCESS_LIMIT=5
while true; do
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}')
if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
echo "[$TIMESTAMP] CPU Usage: $CPU_USAGE%" >> $LOG_FILE
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -n $PROCESS_LIMIT >> $LOG_FILE
fi
sleep 30
done
For AWS-specific monitoring:
sudo apt-get install amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
For deep diagnostics with perf:
# Record CPU usage for 60 seconds
sudo perf record -ag -F 999 -- sleep 60
# Generate report
sudo perf report --sort comm,dso