How to Diagnose and Troubleshoot High CPU Usage in Linux: Identifying Resource-Intensive Processes


10 views

That moment when your EC2 instance becomes unresponsive and SSH connections time out is every sysadmin's nightmare. The Amazon monitoring graph clearly shows CPU pegged at 100%, but without process-level visibility, you're left guessing what caused the spike.

When the system is still barely responsive, try these quick commands before resorting to reboot:


# Quick process snapshot (works even with high load)
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 20

# Alternative when ps is too slow
top -b -n 1 | head -n 20

Once you've regained access, dig deeper with these forensic tools:


# Install sysstat if not present
sudo yum install sysstat -y  # For Amazon Linux
sudo apt-get install sysstat # For Ubuntu/Debian

# View historical CPU usage per process
sar -u ALL 1 3

# Check for CPU steal (common in virtualized environments)
sar -u ALL 1 3 | grep -i steal

Prevent future blindspots by configuring these monitoring solutions:


# Configure sysstat for detailed historical data
sudo sed -i 's/^HISTORY=.*/HISTORY=30/' /etc/sysconfig/sysstat
sudo systemctl enable sysstat
sudo systemctl start sysstat

# Install and configure atop for advanced process accounting
sudo yum install atop -y
sudo systemctl enable atop
sudo systemctl start atop

Create this watchdog script to catch CPU spikes before they become critical:


#!/bin/bash
THRESHOLD=90
ALERT_EMAIL="admin@example.com"

while true; do
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}')
    if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
        PROCESS_LIST=$(ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 10)
        echo -e "High CPU usage detected: ${CPU_USAGE}%\n\nTop processes:\n${PROCESS_LIST}" | \
        mail -s "CPU Alert on $(hostname)" "$ALERT_EMAIL"
    fi
    sleep 60
done

For AWS environments, combine OS-level tools with CloudWatch metrics:


# Install CloudWatch agent for enhanced metrics
sudo yum install amazon-cloudwatch-agent -y

# Configure custom metrics to monitor specific processes
{
    "metrics": {
        "append_dimensions": {
            "InstanceId": "${aws:InstanceId}"
        },
        "metrics_collected": {
            "procstat": [
                {
                    "pattern": "java|python|node",
                    "measurement": [
                        "cpu_usage",
                        "memory_rss"
                    ]
                }
            ]
        }
    }
}

When your EC2 instance hits 100% CPU utilization, it often becomes unresponsive - exactly when you need diagnostic tools the most. Here's how to prepare for and investigate these situations:

Install these tools before you encounter issues:


# For historical process tracking
sudo apt-get install sysstat atop
# For real-time monitoring
sudo apt-get install htop glances

When SSH fails, use these AWS-specific approaches:


# 1. AWS Systems Manager (SSM)
aws ssm start-session --target instance-id

# 2. EC2 Serial Console
# Requires IAM permissions and console access

Once you regain access, gather evidence:

Using SAR for Historical Data


# Show CPU usage history (adjust -f for specific date)
sar -u -f /var/log/sa/sa$(date +%d -d yesterday)

# Show individual process statistics
sar -q -f /var/log/sa/sa$(date +%d -d yesterday)

ATOP Process Accounting


# View recorded process activity
atop -r /var/log/atop/atop_$(date +%Y%m%d -d yesterday)

When the system is responsive:


# Batch version showing PIDs and commands
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -n 20

# Interactive version with thread details
top -H -p $(pgrep -d',' -f "high_cpu_process")

Create a monitoring script in /usr/local/bin/monitor_cpu.sh:


#!/bin/bash
THRESHOLD=90
LOG_FILE=/var/log/cpu_monitor.log
PROCESS_LIMIT=5

while true; do
    TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}')
    
    if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
        echo "[$TIMESTAMP] CPU Usage: $CPU_USAGE%" >> $LOG_FILE
        ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -n $PROCESS_LIMIT >> $LOG_FILE
    fi
    
    sleep 30
done

For AWS-specific monitoring:


sudo apt-get install amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json

For deep diagnostics with perf:


# Record CPU usage for 60 seconds
sudo perf record -ag -F 999 -- sleep 60

# Generate report
sudo perf report --sort comm,dso