When your 4-core server shows a load average of 10 while CPU utilization remains relatively low, you're likely dealing with I/O-bound processes. The Linux load average metric represents not just CPU demand but also processes waiting for other resources like disk I/O, network I/O, or locks.
Here are powerful tools to identify the real culprits:
# Check disk I/O wait
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 0 250000 50000 800000 0 0 1500 200 500 1000 10 5 70 15 0
Key indicators:
- High 'wa' (I/O wait) percentage
- High 'b' (uninterruptible sleep) processes
Use iotop for real-time disk I/O monitoring:
# Install if needed
$ sudo apt-get install iotop
# Run with elevated privileges
$ sudo iotop -oPa
This shows processes actually performing disk I/O, sorted by throughput. The '-o' flag shows only active I/O operations.
For deeper analysis, use pidstat to monitor individual process I/O:
# Monitor disk I/O per process
$ pidstat -d 1
# Monitor context switches (high values indicate I/O wait)
$ pidstat -w 1
Common high-load, low-CPU scenarios:
- Database Servers: Heavy queries causing disk seeks
- Log Processing: Applications writing massive log files
- Backup Jobs: rsync or tar operations scanning filesystems
- Container Orchestration: Docker/K8s pulling images
Once identified, consider these solutions:
# For MySQL servers showing high I/O wait
SET GLOBAL innodb_io_capacity = 2000;
SET GLOBAL innodb_flush_neighbors = 0;
Other optimizations:
- Implement rate limiting for logging
- Use faster storage (SSD/NVMe)
- Adjust kernel I/O scheduler (deadline/noop for SSDs)
- Implement caching layers
Create a cron job to log I/O offenders:
#!/bin/bash
LOG_FILE=/var/log/io_offenders.log
echo "$(date) -- Top I/O Processes" >> $LOG_FILE
iotop -botqqqk --iter=3 | head -n 5 >> $LOG_FILE
Load averages represent the number of processes either:
- Currently executing on CPU
- Waiting for CPU time (runnable)
- Blocked on uninterruptible I/O (usually disk or network)
When you see high load with low CPU utilization, the bottleneck is typically I/O waits. Here's how to identify the culprits:
# 1. Check overall I/O wait
vmstat 1 5
# 2. Identify disk-intensive processes
iotop -oP
# 3. View process states
ps -eo pid,user,state,cmd | grep 'D ' # D = uninterruptible sleep
For a 4-core server showing load average of 10:
- First confirm I/O wait percentage:
# Sample vmstat output showing 30% I/O wait procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 4 0 123456 78900 456789 0 0 1024 128 345 567 10 5 55 30 0
- Then pinpoint specific processes:
# iotop output showing PostgreSQL causing heavy writes Total DISK READ: 15.34 M/s | Total DISK WRITE: 128.45 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 4567 be/4 postgres 0.00 B/s 112.34 M/s 0.00 % 85.23 % postgres: writer process
For containerized environments:
# Check cgroup I/O limits
cat /sys/fs/cgroup/blkio/blkio.throttle.io_service_bytes
# Trace specific process I/O
strace -p 4567 -e trace=file 2>&1 | grep -v ENOENT
For database servers, consider adding these metrics to your monitoring:
# PostgreSQL specific
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'IO';
SELECT * FROM pg_stat_bgwriter;
Implement these in your observability stack:
- Prometheus node_exporter metrics: node_disk_io_time_seconds_total
- Grafana dashboard tracking: load15 vs. cpu_usage vs. disk_io
- AlertManager rules for sustained high load + low CPU
Remember: High load with low CPU often indicates either storage issues (check disk health with smartctl) or network filesystem bottlenecks (NFS timeouts).