When your server shows extreme load averages (67-79) but minimal CPU utilization (3.9% user, 94.5% idle), you're likely dealing with an I/O bottleneck rather than CPU starvation. The vmstat
output reveals the smoking gun:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 110 795604 12328 3980 46676 0 0 0 0 0 0 4 1 95 1
The critical columns to watch:
- b: 110 processes blocked waiting for I/O
- wa: 97% CPU time spent waiting for I/O
- so: Heavy swap out activity (2985-3151 pages/sec)
To pinpoint the exact culprits:
# Check disk latency
iotop -oP
# Identify processes causing disk I/O
pidstat -d 2 5
# Monitor disk activity per process
sudo apt-get install sysstat
iostat -xmdz 2
# Check for memory pressure
free -h && vmstat -s | grep -i "swap"
Your system is thrashing - constantly swapping memory pages to disk. Key indicators:
Mem: 1034784k total, 1021256k used, 13528k free
Swap: 1023960k total, 635752k used
When RAM is exhausted, the system starts using swap space excessively, causing massive disk I/O. This creates exponential performance degradation.
When caught in this situation:
# Emergency memory relief
sync; echo 3 > /proc/sys/vm/drop_caches
# Identify memory hogs
ps aux --sort=-%mem | head -n 10
# Temporary swap priority adjustment
sysctl vm.swappiness=10
To prevent recurrence:
- Upgrade physical RAM if possible
- Configure proper swapiness values in /etc/sysctl.conf:
vm.swappiness=10 vm.vfs_cache_pressure=50
- Implement process resource limits using cgroups
- Consider using zRAM instead of disk swap on modern kernels
Here's a script I use to catch I/O bottlenecks in production:
#!/bin/bash
while true; do
echo -e "\n$(date)"
echo "Load: $(cat /proc/loadavg)"
# Check blocked processes
echo -e "\nBlocked processes:"
ps -eo stat,pid,user,cmd | egrep "^D|^R"
# Disk latency
echo -e "\nDisk latency:"
iostat -xmdz 1 2 | tail -n +4
sleep 5
done
Run this during high load periods to capture real-time diagnostics.
When your Linux server shows high load averages (67.93, 70.63, 79.85) but low CPU utilization (94.5% idle), you're likely facing an I/O bottleneck. The vmstat
output reveals blocked processes (column 'b' shows 110-121) waiting for I/O operations, while iostat
reports heavy disk activity (7128.00 Blk_read/s).
# Check current disk I/O with iotop
sudo iotop -o -P
# Alternative method using pidstat
pidstat -d 2 5
# Find processes with high I/O
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 15
Database Issues: MySQL/PostgreSQL without proper indexing can cause heavy disk reads. Check with:
# For MySQL
SHOW PROCESSLIST;
SHOW ENGINE INNODB STATUS;
# For Postgres
SELECT * FROM pg_stat_activity;
Log File Rotation: Large log files being rotated can cause spikes:
# Check logrotate status
ls -lh /var/log
grep -r "error" /var/log/
Adjust kernel parameters for better I/O handling:
# Temporary adjustment
echo 50 > /proc/sys/vm/dirty_ratio
echo 10 > /proc/sys/vm/dirty_background_ratio
# Permanent in /etc/sysctl.conf
vm.dirty_ratio = 50
vm.dirty_background_ratio = 10
Create a monitoring script to catch future issues:
#!/bin/bash
# Monitor load vs CPU usage
LOAD=$(cat /proc/loadavg | cut -d' ' -f1)
CPU_IDLE=$(mpstat 1 1 | awk '/Average:/ {print $NF}')
THRESHOLD=5 # Adjust based on core count
if (( $(echo "$LOAD > $THRESHOLD" | bc -l) )) && (( $(echo "$CPU_IDLE > 80" | bc -l) )); then
echo "I/O Wait detected at $(date)" >> /var/log/io_warn.log
iotop -n 2 -b -o >> /var/log/io_warn.log
fi
atop
- Advanced system monitoringdstat
- Combined resource statisticsblktrace
- Block layer I/O tracingstrace
- System call tracing for specific processes