How to Monitor GPU Utilization on AWS EC2 Ubuntu Instances for AI/ML Workloads


2 views

When running GPU-intensive workloads like deep learning training or CUDA-accelerated applications on AWS EC2 instances (P3, P4, G4/G5 series), verifying actual GPU utilization is crucial. Here are the most effective methods:

The primary tool for monitoring NVIDIA GPUs on Linux instances. First ensure you have the NVIDIA drivers installed:

sudo apt update
sudo apt install -y nvidia-utils- # e.g. nvidia-utils-535

Then run the monitoring command:

watch -n 1 nvidia-smi

This displays real-time metrics including:

  • GPU utilization percentage
  • Memory usage
  • Active processes
  • Power draw and temperature

For production environments, NVIDIA's Data Center GPU Manager provides deeper insights:

sudo apt install -y datacenter-gpu-manager
sudo systemctl --now enable nvidia-dcgm
dcgmi discovery -l
dcgmi dmon -e 203,204,1001 -c 5

AWS provides native GPU monitoring through CloudWatch. Enable these metrics in your instance's IAM role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData",
        "ec2:DescribeTags"
      ],
      "Resource": "*"
    }
  ]
}

Then install the CloudWatch agent:

sudo apt install -y amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:AmazonCloudWatch-linux -s

To confirm your ML framework is actually using the GPU:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)

# Create a simple operation to trigger GPU usage
with tf.device('/GPU:0'):
  a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
  b = tf.constant([[1.0, 1.0], [0.0, 1.0]])
  c = tf.matmul(a, b)
  print(c)

If you're not seeing expected GPU utilization:

  • Verify CUDA version compatibility: nvcc --version
  • Check GPU-enabled kernel modules: lsmod | grep nvidia
  • Review application logs for CUDA errors
  • Confirm proper NVIDIA driver installation: dpkg -l | grep nvidia

When running compute-intensive workloads on AWS EC2 GPU instances (P3, P4, G4/G5, or Inferentia), verifying GPU utilization is crucial for:

  • Validating CUDA/cuDNN acceleration
  • Identifying performance bottlenecks
  • Optimizing cost-efficiency of GPU resources

Ubuntu provides several built-in and third-party tools for GPU monitoring:

# Install NVIDIA system management interface
sudo apt update
sudo apt install -y nvidia-smi

The most direct method for NVIDIA GPUs:

# Basic GPU stats (single snapshot)
nvidia-smi

# Continuous monitoring (refresh every 1 second)
watch -n 1 nvidia-smi

# CSV output for logging
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory --format=csv -l 1
Metric Description Optimal Range
GPU-Util Percentage of GPU cores in use 70-100% for full utilization
Memory Usage VRAM allocation/usage Depends on model size
Temperature GPU core temperature <85°C for sustained loads

For long-term analysis:

# Install DCGM for detailed metrics
sudo apt install -y datacenter-gpu-manager
sudo systemctl --now enable nvidia-dcgm

# Prometheus exporter setup
git clone https://github.com/NVIDIA/gpu-monitoring-tools
cd gpu-monitoring-tools/exporters/prometheus
make
./nvidia_gpu_prometheus_exporter

If nvidia-smi shows 0% utilization:

  1. Verify CUDA drivers are installed: nvcc --version
  2. Check application is using GPU context
  3. Confirm no ECC errors: nvidia-smi -q

For AWS-native monitoring:

# Install CloudWatch agent
sudo apt install -y amazon-cloudwatch-agent

# Configure GPU metrics
sudo nano /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
{
  "metrics": {
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "nvidia_gpu": {
        "measurement": [
          "utilization_gpu",
          "memory_used",
          "power_draw"
        ]
      }
    }
  }
}