When your Linux server shows high load average but CPU usage appears normal and there's no swapping activity, disk I/O contention is often the culprit. The challenge lies in identifying which processes are actually blocked waiting for I/O operations to complete.
While iotop
shows active I/O operations, we need different tools to reveal blocked processes:
# 1. Using dstat for comprehensive I/O metrics
dstat -td --disk-util --top-bio
# 2. Checking process states with ps
ps -eo stat,pid,user,command | awk '$1 ~ /D/ {print $0}'
For advanced analysis, SystemTap provides detailed I/O wait tracing:
# SystemTap script to track I/O wait
probe scheduler.iosched.wait {
printf("%d %s %d\n", pid(), execname(), $rw)
}
Let's examine a real MySQL server case:
# First identify D-state processes
$ ps -eo state,pid,cmd | grep '^D'
D 29843 /usr/sbin/mysqld
# Then check I/O wait with pidstat
$ pidstat -d -p 29843 1 5
Linux 5.4.0-91-generic (db-server) 02/20/2023 _x86_64_ (8 CPU)
02:30:01 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
02:30:02 PM 112 29843 0.00 1024.00 0.00 mysqld
Other useful techniques include:
vmstat 1
to checkwa
CPU timeiostat -x 1
for device-level metricsbiosnoop
from BCC tools for low-level tracing
Key indicators of I/O wait bottlenecks:
# High await time in iostat output
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 5.00 10.00 150.00 80.00 1200.00 16.00 2.50 15.00 5.00 16.00 6.00 96.00
When your Linux server shows high load averages without corresponding CPU or memory pressure, disk I/O contention is often the culprit. Unlike CPU-bound processes that show up clearly in top
or htop
, I/O wait states can be more elusive to diagnose.
While iotop
shows active I/O operations, these alternatives reveal processes waiting for I/O:
# 1. Using dstat for real-time monitoring
dstat -ta --top-io
# 2. Checking process states with ps
ps -eo state,pid,cmd | grep "^D"
# 3. Comprehensive view with pidstat
pidstat -d 1
Processes in "D" state (uninterruptible sleep) in ps
output are typically waiting for I/O. Common scenarios include:
- Database transactions waiting on slow storage
- Log rotation processes blocked on large file operations
- Backup jobs competing for disk access
For deeper investigation, combine these approaches:
# Monitor per-process I/O wait with perf
sudo perf top -e sched:sched_stat_iowait
# Check I/O scheduler queues
cat /sys/block/sdX/queue/nr_requests
# Identify contended files with lsof +D
sudo lsof +D /var/lib/mysql
Consider this troubleshooting session for a stuck backup process:
$ ps -eo state,pid,cmd | grep "^D"
D 29847 /usr/bin/rsync -avz /data backup-server:/backups
$ sudo strace -p 29847
[...]
read(4, 0x7ffd3a1f2000, 131072) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[...]
$ sudo lsof -p 29847
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rsync 29847 root 4r REG 8,16 4294967296 123456 /data/large_vm_image.qcow2
This reveals the process is blocked trying to read a massive VM image file.
When you identify I/O-bound processes:
- Prioritize critical processes with
ionice
- Consider filesystem tuning (noatime, data=writeback)
- Implement rate limiting for non-critical jobs
- Evaluate storage upgrade options