Diagnosing High Load Average with Low CPU Usage: I/O Wait Bottleneck Analysis for Linux Servers


2 views

When your server shows extreme load averages (67-79) but minimal CPU utilization (3.9% user, 94.5% idle), you're likely dealing with an I/O bottleneck rather than CPU starvation. The vmstat output reveals the smoking gun:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 110 795604  12328   3980  46676    0    0     0     0    0     0  4  1 95  1

The critical columns to watch:

  • b: 110 processes blocked waiting for I/O
  • wa: 97% CPU time spent waiting for I/O
  • so: Heavy swap out activity (2985-3151 pages/sec)

To pinpoint the exact culprits:

# Check disk latency
iotop -oP

# Identify processes causing disk I/O
pidstat -d 2 5

# Monitor disk activity per process
sudo apt-get install sysstat
iostat -xmdz 2

# Check for memory pressure
free -h && vmstat -s | grep -i "swap"

Your system is thrashing - constantly swapping memory pages to disk. Key indicators:

Mem:   1034784k total,  1021256k used,    13528k free
Swap:  1023960k total,   635752k used

When RAM is exhausted, the system starts using swap space excessively, causing massive disk I/O. This creates exponential performance degradation.

When caught in this situation:

# Emergency memory relief
sync; echo 3 > /proc/sys/vm/drop_caches

# Identify memory hogs
ps aux --sort=-%mem | head -n 10

# Temporary swap priority adjustment
sysctl vm.swappiness=10

To prevent recurrence:

  1. Upgrade physical RAM if possible
  2. Configure proper swapiness values in /etc/sysctl.conf:
    vm.swappiness=10
    vm.vfs_cache_pressure=50
    
  3. Implement process resource limits using cgroups
  4. Consider using zRAM instead of disk swap on modern kernels

Here's a script I use to catch I/O bottlenecks in production:

#!/bin/bash
while true; do
  echo -e "\n$(date)"
  echo "Load: $(cat /proc/loadavg)"
  
  # Check blocked processes
  echo -e "\nBlocked processes:"
  ps -eo stat,pid,user,cmd | egrep "^D|^R"
  
  # Disk latency
  echo -e "\nDisk latency:"
  iostat -xmdz 1 2 | tail -n +4
  
  sleep 5
done

Run this during high load periods to capture real-time diagnostics.


When your Linux server shows high load averages (67.93, 70.63, 79.85) but low CPU utilization (94.5% idle), you're likely facing an I/O bottleneck. The vmstat output reveals blocked processes (column 'b' shows 110-121) waiting for I/O operations, while iostat reports heavy disk activity (7128.00 Blk_read/s).

# Check current disk I/O with iotop
sudo iotop -o -P

# Alternative method using pidstat
pidstat -d 2 5

# Find processes with high I/O
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 15

Database Issues: MySQL/PostgreSQL without proper indexing can cause heavy disk reads. Check with:

# For MySQL
SHOW PROCESSLIST;
SHOW ENGINE INNODB STATUS;

# For Postgres
SELECT * FROM pg_stat_activity;

Log File Rotation: Large log files being rotated can cause spikes:

# Check logrotate status
ls -lh /var/log
grep -r "error" /var/log/

Adjust kernel parameters for better I/O handling:

# Temporary adjustment
echo 50 > /proc/sys/vm/dirty_ratio
echo 10 > /proc/sys/vm/dirty_background_ratio

# Permanent in /etc/sysctl.conf
vm.dirty_ratio = 50
vm.dirty_background_ratio = 10

Create a monitoring script to catch future issues:

#!/bin/bash
# Monitor load vs CPU usage
LOAD=$(cat /proc/loadavg | cut -d' ' -f1)
CPU_IDLE=$(mpstat 1 1 | awk '/Average:/ {print $NF}')
THRESHOLD=5 # Adjust based on core count

if (( $(echo "$LOAD > $THRESHOLD" | bc -l) )) && (( $(echo "$CPU_IDLE > 80" | bc -l) )); then
    echo "I/O Wait detected at $(date)" >> /var/log/io_warn.log
    iotop -n 2 -b -o >> /var/log/io_warn.log
fi
  • atop - Advanced system monitoring
  • dstat - Combined resource statistics
  • blktrace - Block layer I/O tracing
  • strace - System call tracing for specific processes