Resolving CPU Usage Discrepancy Between Linux top Command and AWS CloudWatch Metrics


2 views

When troubleshooting CPU utilization differences between system-level tools and cloud monitoring services, we need to examine their fundamental measurement approaches:

// Example of how CloudWatch collects CPU metrics (simplified)
const cloudWatchMetric = {
  metricName: 'CPUUtilization',
  namespace: 'AWS/EC2',
  dimensions: [ { Name: 'InstanceId', Value: 'i-1234567890abcdef0' } ],
  statistics: ['Average'],
  period: 300, // 5-minute intervals
  unit: 'Percent'
};

The discrepancy primarily stems from these technical aspects:

  • Sampling Frequency: CloudWatch defaults to 5-minute intervals while top shows real-time data
  • Measurement Scope: CloudWatch includes all CPU time including stolen cycles in virtualized environments
  • Calculation Method: CloudWatch reports CPU usage as percentage of total available CPU capacity

To verify the actual CPU usage, we can use more precise Linux tools:

# Get detailed CPU metrics (sample every second for 5 iterations)
sar -u 1 5

# Alternative method using mpstat
mpstat -P ALL 1 5

# Compare with CloudWatch CLI query
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" --date '-5 minutes') \
  --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --period 60 \
  --statistics Average \
  --output json

In AWS EC2 environments, several virtualization factors affect CPU measurements:

# Check for CPU steal time (relevant for virtualized instances)
grep -E '^cpu ' /proc/stat | awk '{print $1,$2,$3,$4,$5,$8,$9}'

# Expected output format:
# cpu user nice system idle iowait steal guest

For deeper analysis, consider these approaches:

  1. Custom CloudWatch Metrics: Push more granular system metrics using the CloudWatch agent
  2. Unified Monitoring: Configure the CloudWatch agent to collect system-level metrics at higher frequency
  3. Baseline Comparison: Compare top metrics with CloudWatch during known load patterns
# Example CloudWatch agent configuration for detailed CPU metrics
{
  "metrics": {
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_user",
          "cpu_usage_system",
          "cpu_usage_iowait",
          "cpu_usage_steal"
        ],
        "metrics_collection_interval": 60,
        "resources": ["*"],
        "totalcpu": true
      }
    }
  }
}

When tuning applications based on CPU metrics:

  • For burstable instance types (T-series), monitor CPU credits alongside utilization
  • Consider EC2 instance right-sizing if consistent discrepancies indicate misalignment
  • Evaluate whether your application is more sensitive to system-level or hypervisor-level CPU constraints

The primary reason for the discrepancy lies in how these tools measure CPU utilization:

// Example of how CloudWatch calculates CPU usage
total_cpu_used = (user_time + system_time + nice_time) / (user_time + system_time + nice_time + idle_time + iowait_time + irq_time + softirq_time + steal_time)

In contrast, top typically uses a simpler calculation:

// Simplified top calculation
cpu_usage = 100% - (idle_time / total_time)

Several technical elements contribute to the variance:

  • Steal Time (EC2 specific): Virtual machines may experience CPU contention
  • I/O Wait Handling: CloudWatch counts this differently than most top implementations
  • Sampling Interval: CloudWatch defaults to 1-minute averages while top shows instant snapshots

For more accurate comparisons, consider using these commands:

# mpstat (shows detailed CPU breakdown)
mpstat -P ALL 1 5

# Alternative using /proc/stat
cat /proc/stat | grep '^cpu '

CloudWatch has some EC2-specific behaviors:

// Example CloudWatch API call for CPU metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2023-10-01T00:00:00Z \
  --end-time 2023-10-01T23:59:59Z \
  --period 300 \
  --statistics Average

To align your understanding:

  1. Enable detailed CloudWatch monitoring (1-minute granularity)
  2. Compare with mpstat averages over the same period
  3. Account for steal time in your analysis

Example reconciliation script:

#!/bin/bash
# Compare CloudWatch and system CPU metrics
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ" -d "5 minutes ago")
END_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

CLOUDWATCH_CPU=$(aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --start-time $START_TIME \
  --end-time $END_TIME \
  --period 60 \
  --statistics Average \
  --query 'Datapoints[0].Average')

SYSTEM_CPU=$(mpstat 1 1 | awk '/all/ {print 100 - $NF}')

echo "CloudWatch CPU: $CLOUDWATCH_CPU%"
echo "System CPU: $SYSTEM_CPU%"