Diagnosing High Server Load When CPU Usage is Low: Identifying I/O-Bound Processes


3 views

When your 4-core server shows a load average of 10 while CPU utilization remains relatively low, you're likely dealing with I/O-bound processes. The Linux load average metric represents not just CPU demand but also processes waiting for other resources like disk I/O, network I/O, or locks.

Here are powerful tools to identify the real culprits:

# Check disk I/O wait
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1      0 250000  50000 800000    0    0  1500   200  500 1000 10  5 70 15  0

Key indicators:
- High 'wa' (I/O wait) percentage
- High 'b' (uninterruptible sleep) processes

Use iotop for real-time disk I/O monitoring:

# Install if needed
$ sudo apt-get install iotop

# Run with elevated privileges
$ sudo iotop -oPa

This shows processes actually performing disk I/O, sorted by throughput. The '-o' flag shows only active I/O operations.

For deeper analysis, use pidstat to monitor individual process I/O:

# Monitor disk I/O per process
$ pidstat -d 1

# Monitor context switches (high values indicate I/O wait)
$ pidstat -w 1

Common high-load, low-CPU scenarios:

  1. Database Servers: Heavy queries causing disk seeks
  2. Log Processing: Applications writing massive log files
  3. Backup Jobs: rsync or tar operations scanning filesystems
  4. Container Orchestration: Docker/K8s pulling images

Once identified, consider these solutions:

# For MySQL servers showing high I/O wait
SET GLOBAL innodb_io_capacity = 2000;
SET GLOBAL innodb_flush_neighbors = 0;

Other optimizations:
- Implement rate limiting for logging
- Use faster storage (SSD/NVMe)
- Adjust kernel I/O scheduler (deadline/noop for SSDs)
- Implement caching layers

Create a cron job to log I/O offenders:

#!/bin/bash
LOG_FILE=/var/log/io_offenders.log
echo "$(date) -- Top I/O Processes" >> $LOG_FILE
iotop -botqqqk --iter=3 | head -n 5 >> $LOG_FILE

Load averages represent the number of processes either:

  • Currently executing on CPU
  • Waiting for CPU time (runnable)
  • Blocked on uninterruptible I/O (usually disk or network)

When you see high load with low CPU utilization, the bottleneck is typically I/O waits. Here's how to identify the culprits:

# 1. Check overall I/O wait
vmstat 1 5

# 2. Identify disk-intensive processes
iotop -oP

# 3. View process states
ps -eo pid,user,state,cmd | grep 'D '  # D = uninterruptible sleep

For a 4-core server showing load average of 10:

  1. First confirm I/O wait percentage:
    # Sample vmstat output showing 30% I/O wait
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     1  4      0 123456  78900 456789    0    0  1024   128  345  567 10  5 55 30  0
  2. Then pinpoint specific processes:
    # iotop output showing PostgreSQL causing heavy writes
    Total DISK READ: 15.34 M/s | Total DISK WRITE: 128.45 M/s
      PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
     4567 be/4 postgres    0.00 B/s  112.34 M/s  0.00 % 85.23 % postgres: writer process

For containerized environments:

# Check cgroup I/O limits
cat /sys/fs/cgroup/blkio/blkio.throttle.io_service_bytes

# Trace specific process I/O
strace -p 4567 -e trace=file 2>&1 | grep -v ENOENT

For database servers, consider adding these metrics to your monitoring:

# PostgreSQL specific
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'IO';
SELECT * FROM pg_stat_bgwriter;

Implement these in your observability stack:

  • Prometheus node_exporter metrics: node_disk_io_time_seconds_total
  • Grafana dashboard tracking: load15 vs. cpu_usage vs. disk_io
  • AlertManager rules for sustained high load + low CPU

Remember: High load with low CPU often indicates either storage issues (check disk health with smartctl) or network filesystem bottlenecks (NFS timeouts).