How to Diagnose and Troubleshoot High CPU Usage in Linux: Identifying Resource-Intensive Processes


2 views

That moment when your EC2 instance becomes unresponsive and SSH connections time out is every sysadmin's nightmare. The Amazon monitoring graph clearly shows CPU pegged at 100%, but without process-level visibility, you're left guessing what caused the spike.

When the system is still barely responsive, try these quick commands before resorting to reboot:


# Quick process snapshot (works even with high load)
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 20

# Alternative when ps is too slow
top -b -n 1 | head -n 20

Once you've regained access, dig deeper with these forensic tools:


# Install sysstat if not present
sudo yum install sysstat -y  # For Amazon Linux
sudo apt-get install sysstat # For Ubuntu/Debian

# View historical CPU usage per process
sar -u ALL 1 3

# Check for CPU steal (common in virtualized environments)
sar -u ALL 1 3 | grep -i steal

Prevent future blindspots by configuring these monitoring solutions:


# Configure sysstat for detailed historical data
sudo sed -i 's/^HISTORY=.*/HISTORY=30/' /etc/sysconfig/sysstat
sudo systemctl enable sysstat
sudo systemctl start sysstat

# Install and configure atop for advanced process accounting
sudo yum install atop -y
sudo systemctl enable atop
sudo systemctl start atop

Create this watchdog script to catch CPU spikes before they become critical:


#!/bin/bash
THRESHOLD=90
ALERT_EMAIL="admin@example.com"

while true; do
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}')
    if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
        PROCESS_LIST=$(ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 10)
        echo -e "High CPU usage detected: ${CPU_USAGE}%\n\nTop processes:\n${PROCESS_LIST}" | \
        mail -s "CPU Alert on $(hostname)" "$ALERT_EMAIL"
    fi
    sleep 60
done

For AWS environments, combine OS-level tools with CloudWatch metrics:


# Install CloudWatch agent for enhanced metrics
sudo yum install amazon-cloudwatch-agent -y

# Configure custom metrics to monitor specific processes
{
    "metrics": {
        "append_dimensions": {
            "InstanceId": "${aws:InstanceId}"
        },
        "metrics_collected": {
            "procstat": [
                {
                    "pattern": "java|python|node",
                    "measurement": [
                        "cpu_usage",
                        "memory_rss"
                    ]
                }
            ]
        }
    }
}

When your EC2 instance hits 100% CPU utilization, it often becomes unresponsive - exactly when you need diagnostic tools the most. Here's how to prepare for and investigate these situations:

Install these tools before you encounter issues:


# For historical process tracking
sudo apt-get install sysstat atop
# For real-time monitoring
sudo apt-get install htop glances

When SSH fails, use these AWS-specific approaches:


# 1. AWS Systems Manager (SSM)
aws ssm start-session --target instance-id

# 2. EC2 Serial Console
# Requires IAM permissions and console access

Once you regain access, gather evidence:

Using SAR for Historical Data


# Show CPU usage history (adjust -f for specific date)
sar -u -f /var/log/sa/sa$(date +%d -d yesterday)

# Show individual process statistics
sar -q -f /var/log/sa/sa$(date +%d -d yesterday)

ATOP Process Accounting


# View recorded process activity
atop -r /var/log/atop/atop_$(date +%Y%m%d -d yesterday)

When the system is responsive:


# Batch version showing PIDs and commands
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -n 20

# Interactive version with thread details
top -H -p $(pgrep -d',' -f "high_cpu_process")

Create a monitoring script in /usr/local/bin/monitor_cpu.sh:


#!/bin/bash
THRESHOLD=90
LOG_FILE=/var/log/cpu_monitor.log
PROCESS_LIMIT=5

while true; do
    TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}')
    
    if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
        echo "[$TIMESTAMP] CPU Usage: $CPU_USAGE%" >> $LOG_FILE
        ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -n $PROCESS_LIMIT >> $LOG_FILE
    fi
    
    sleep 30
done

For AWS-specific monitoring:


sudo apt-get install amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json

For deep diagnostics with perf:


# Record CPU usage for 60 seconds
sudo perf record -ag -F 999 -- sleep 60

# Generate report
sudo perf report --sort comm,dso