Discrepancy in CPU Monitoring: AWS EC2 Shows 100% While Top Reports 20% During Python Database Operations


2 views

When working with EC2 instances, it's crucial to recognize that AWS CloudWatch measures CPU utilization at the hypervisor level, while tools like top or htop report CPU usage from the guest OS perspective. This fundamental difference explains why you might see conflicting metrics.

The 80% gap in your observation likely stems from these sources:

  • Steal Time (ST): When other VMs on the same physical host demand resources
  • Network I/O Wait: Particularly important for database operations between instances
  • Hypervisor Overhead: AWS's virtualization layer consumption
  • Interrupt Handling: Network and storage interrupts not fully accounted in top

To pinpoint the exact cause, run these commands while your script executes:

# Check for steal time
vmstat 1 5

# Detailed CPU breakdown
mpstat -P ALL 1

# Network-specific metrics
ifstat -t 1

# Alternative process viewer
sudo apt install sysstat
pidstat -u 1

For database operations causing high I/O wait, consider batching your inserts:

import psycopg2
from psycopg2 import sql

def batch_insert(records, batch_size=1000):
    conn = psycopg2.connect("dbname=test user=postgres")
    cur = conn.cursor()
    
    for i in range(0, len(records), batch_size):
        batch = records[i:i + batch_size]
        query = sql.SQL("INSERT INTO table (col1, col2) VALUES {}").format(
            sql.SQL(',').join(map(sql.Literal, batch))
        )
        try:
            cur.execute(query)
            conn.commit()
        except Exception as e:
            conn.rollback()
            print(f"Batch failed: {e}")
    
    conn.close()

For more accurate metrics, set up CloudWatch agent with custom metrics:

# Install CloudWatch agent
sudo yum install amazon-cloudwatch-agent

# Configure custom metrics
{
    "metrics": {
        "append_dimensions": {
            "InstanceId": "${aws:InstanceId}"
        },
        "metrics_collected": {
            "cpu": {
                "measurement": [
                    "usage_steal",
                    "usage_iowait",
                    "usage_user",
                    "usage_system"
                ],
                "resources": ["*"],
                "totalcpu": true
            },
            "net": {
                "measurement": [
                    "bytes_sent",
                    "bytes_recv"
                ]
            }
        }
    }
}

Persistent high steal time (>10%) suggests neighboring noisy neighbors. Consider these alternatives:

  • Switch to dedicated host (expensive but predictable)
  • Use compute-optimized instances (C5/C6 family)
  • Try burstable instances (T3/T4g) with unlimited mode

When running a Python database insertion script between EC2 instances, you're seeing conflicting CPU metrics:

EC2 CloudWatch: 100% CPU utilization
top command:   20% CPU for Python process

This discrepancy stems from fundamental monitoring differences:

  • CloudWatch measures entire vCPU capacity including hypervisor-level operations
  • top shows userland process CPU as percentage of single core
1. Network Stack Processing (softirqs)
2. Disk I/O Wait States
3. Hypervisor Overhead
4. Interrupt Handling
5. Kernel Workers (kworker threads)

Run these to identify hidden CPU consumers:

# Show all CPU usage (user + system + steal)
mpstat -P ALL 1 5

# Check for I/O wait
vmstat 1 5

# Monitor network interrupts
cat /proc/interrupts | grep eth0

# Kernel worker threads
ps aux | grep kworker

For database insert workloads, consider these patterns:

# Bulk insert instead of single-row transactions
with connection.cursor() as cur:
    args_str = b','.join(cur.mogrify("(%s,%s)", x) for x in data)
    cur.execute(f"INSERT INTO table VALUES {args_str}")

# Tune socket buffers for network-heavy workloads
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 8192)

Adjust these parameters in /etc/sysctl.conf:

# Network stack optimization
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

For comprehensive visibility:

# Install and run htop for threaded view
sudo apt install htop
htop

# Kernel-level monitoring
sudo apt install sysstat
pidstat -urd -h 1