When troubleshooting CPU utilization differences between system-level tools and cloud monitoring services, we need to examine their fundamental measurement approaches:
// Example of how CloudWatch collects CPU metrics (simplified)
const cloudWatchMetric = {
metricName: 'CPUUtilization',
namespace: 'AWS/EC2',
dimensions: [ { Name: 'InstanceId', Value: 'i-1234567890abcdef0' } ],
statistics: ['Average'],
period: 300, // 5-minute intervals
unit: 'Percent'
};
The discrepancy primarily stems from these technical aspects:
- Sampling Frequency: CloudWatch defaults to 5-minute intervals while top shows real-time data
- Measurement Scope: CloudWatch includes all CPU time including stolen cycles in virtualized environments
- Calculation Method: CloudWatch reports CPU usage as percentage of total available CPU capacity
To verify the actual CPU usage, we can use more precise Linux tools:
# Get detailed CPU metrics (sample every second for 5 iterations)
sar -u 1 5
# Alternative method using mpstat
mpstat -P ALL 1 5
# Compare with CloudWatch CLI query
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" --date '-5 minutes') \
--end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--period 60 \
--statistics Average \
--output json
In AWS EC2 environments, several virtualization factors affect CPU measurements:
# Check for CPU steal time (relevant for virtualized instances)
grep -E '^cpu ' /proc/stat | awk '{print $1,$2,$3,$4,$5,$8,$9}'
# Expected output format:
# cpu user nice system idle iowait steal guest
For deeper analysis, consider these approaches:
- Custom CloudWatch Metrics: Push more granular system metrics using the CloudWatch agent
- Unified Monitoring: Configure the CloudWatch agent to collect system-level metrics at higher frequency
- Baseline Comparison: Compare top metrics with CloudWatch during known load patterns
# Example CloudWatch agent configuration for detailed CPU metrics
{
"metrics": {
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_user",
"cpu_usage_system",
"cpu_usage_iowait",
"cpu_usage_steal"
],
"metrics_collection_interval": 60,
"resources": ["*"],
"totalcpu": true
}
}
}
}
When tuning applications based on CPU metrics:
- For burstable instance types (T-series), monitor CPU credits alongside utilization
- Consider EC2 instance right-sizing if consistent discrepancies indicate misalignment
- Evaluate whether your application is more sensitive to system-level or hypervisor-level CPU constraints
The primary reason for the discrepancy lies in how these tools measure CPU utilization:
// Example of how CloudWatch calculates CPU usage
total_cpu_used = (user_time + system_time + nice_time) / (user_time + system_time + nice_time + idle_time + iowait_time + irq_time + softirq_time + steal_time)
In contrast, top typically uses a simpler calculation:
// Simplified top calculation
cpu_usage = 100% - (idle_time / total_time)
Several technical elements contribute to the variance:
- Steal Time (EC2 specific): Virtual machines may experience CPU contention
- I/O Wait Handling: CloudWatch counts this differently than most top implementations
- Sampling Interval: CloudWatch defaults to 1-minute averages while top shows instant snapshots
For more accurate comparisons, consider using these commands:
# mpstat (shows detailed CPU breakdown)
mpstat -P ALL 1 5
# Alternative using /proc/stat
cat /proc/stat | grep '^cpu '
CloudWatch has some EC2-specific behaviors:
// Example CloudWatch API call for CPU metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2023-10-01T00:00:00Z \
--end-time 2023-10-01T23:59:59Z \
--period 300 \
--statistics Average
To align your understanding:
- Enable detailed CloudWatch monitoring (1-minute granularity)
- Compare with mpstat averages over the same period
- Account for steal time in your analysis
Example reconciliation script:
#!/bin/bash
# Compare CloudWatch and system CPU metrics
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ" -d "5 minutes ago")
END_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
CLOUDWATCH_CPU=$(aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=$INSTANCE_ID \
--start-time $START_TIME \
--end-time $END_TIME \
--period 60 \
--statistics Average \
--query 'Datapoints[0].Average')
SYSTEM_CPU=$(mpstat 1 1 | awk '/all/ {print 100 - $NF}')
echo "CloudWatch CPU: $CLOUDWATCH_CPU%"
echo "System CPU: $SYSTEM_CPU%"