When running GPU-intensive workloads like deep learning training or CUDA-accelerated applications on AWS EC2 instances (P3, P4, G4/G5 series), verifying actual GPU utilization is crucial. Here are the most effective methods:
The primary tool for monitoring NVIDIA GPUs on Linux instances. First ensure you have the NVIDIA drivers installed:
sudo apt update
sudo apt install -y nvidia-utils- # e.g. nvidia-utils-535
Then run the monitoring command:
watch -n 1 nvidia-smi
This displays real-time metrics including:
- GPU utilization percentage
- Memory usage
- Active processes
- Power draw and temperature
For production environments, NVIDIA's Data Center GPU Manager provides deeper insights:
sudo apt install -y datacenter-gpu-manager
sudo systemctl --now enable nvidia-dcgm
dcgmi discovery -l
dcgmi dmon -e 203,204,1001 -c 5
AWS provides native GPU monitoring through CloudWatch. Enable these metrics in your instance's IAM role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"ec2:DescribeTags"
],
"Resource": "*"
}
]
}
Then install the CloudWatch agent:
sudo apt install -y amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:AmazonCloudWatch-linux -s
To confirm your ML framework is actually using the GPU:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)
# Create a simple operation to trigger GPU usage
with tf.device('/GPU:0'):
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
b = tf.constant([[1.0, 1.0], [0.0, 1.0]])
c = tf.matmul(a, b)
print(c)
If you're not seeing expected GPU utilization:
- Verify CUDA version compatibility:
nvcc --version
- Check GPU-enabled kernel modules:
lsmod | grep nvidia
- Review application logs for CUDA errors
- Confirm proper NVIDIA driver installation:
dpkg -l | grep nvidia
When running compute-intensive workloads on AWS EC2 GPU instances (P3, P4, G4/G5, or Inferentia), verifying GPU utilization is crucial for:
- Validating CUDA/cuDNN acceleration
- Identifying performance bottlenecks
- Optimizing cost-efficiency of GPU resources
Ubuntu provides several built-in and third-party tools for GPU monitoring:
# Install NVIDIA system management interface
sudo apt update
sudo apt install -y nvidia-smi
The most direct method for NVIDIA GPUs:
# Basic GPU stats (single snapshot)
nvidia-smi
# Continuous monitoring (refresh every 1 second)
watch -n 1 nvidia-smi
# CSV output for logging
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory --format=csv -l 1
Metric | Description | Optimal Range |
---|---|---|
GPU-Util | Percentage of GPU cores in use | 70-100% for full utilization |
Memory Usage | VRAM allocation/usage | Depends on model size |
Temperature | GPU core temperature | <85°C for sustained loads |
For long-term analysis:
# Install DCGM for detailed metrics
sudo apt install -y datacenter-gpu-manager
sudo systemctl --now enable nvidia-dcgm
# Prometheus exporter setup
git clone https://github.com/NVIDIA/gpu-monitoring-tools
cd gpu-monitoring-tools/exporters/prometheus
make
./nvidia_gpu_prometheus_exporter
If nvidia-smi
shows 0% utilization:
- Verify CUDA drivers are installed:
nvcc --version
- Check application is using GPU context
- Confirm no ECC errors:
nvidia-smi -q
For AWS-native monitoring:
# Install CloudWatch agent
sudo apt install -y amazon-cloudwatch-agent
# Configure GPU metrics
sudo nano /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
{
"metrics": {
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"nvidia_gpu": {
"measurement": [
"utilization_gpu",
"memory_used",
"power_draw"
]
}
}
}
}