Linux load averages represent the average number of processes in the run queue over time periods (1, 5, and 15 minutes). A load of 19.79 on your 16-core system suggests significant process contention, while CPU utilization shows 95% idle. This discrepancy indicates blocked processes waiting for resources that aren't CPU cycles.
First, verify disk I/O bottlenecks with:
# Check disk latency
iostat -xm 2
# Detailed process I/O
iotop -oPa
For network-related waits:
# Network socket statistics
ss -tulpn
# Detailed TCP info
cat /proc/net/tcp | awk '{print $4}' | sort | uniq -c
Your memory shows:
Mem: 3940MB total, 3910MB used
-/+ buffers/cache: 2445MB used
Swap: 4110MB total, 236MB used
Despite appearing full, Linux aggressively caches files. The key metric is swap usage - minimal here. Check for memory pressure with:
# Page fault statistics
vmstat -s
# Slab memory usage
slabtop -o
Your ps aux
output reveals several kernel threads with accumulated CPU time:
kswapd1
: 94:48 CPU time - memory reclaim- Multiple
kblockd
threads: Block device operations
Check for storage issues with:
# Disk health check
smartctl -a /dev/sda
# Filesystem errors
dmesg | grep -i error
For deeper analysis:
# System-wide tracing
perf record -a -g sleep 10
perf report
# Process-specific waits
strace -p [PID] -c
Verify system limits that might cause process queuing:
# Process limits
sysctl kernel.pid_max
# File handles
cat /proc/sys/fs/file-nr
# IO scheduler
cat /sys/block/sda/queue/scheduler
In one case, high load was caused by NFS timeouts. The solution was:
# /etc/fstab adjustment
nas:/exports /mnt/nfs nfs rw,hard,intr,timeo=600,retrans=2 0 0
# Sysctl tuning
echo "sunrpc.tcp_slot_table_entries=128" >> /etc/sysctl.conf
sysctl -p
Always measure changes with before/after comparisons using sar -q
or similar tools.
When your Linux system reports load averages of 19-21 (as shown in the top output) while displaying 95% CPU idle time and minimal memory pressure, you're facing a classic I/O-bound scenario. The system isn't CPU-starved - it's waiting on something else.
# Top output highlights:
Load average: 19.79, 21.25, 18.87
Cpu(s): 95.0%id
Wa: 0.6% # I/O wait percentage
253 sleeping processes
The 0.6% iowait might seem low, but combined with the high load average, it suggests processes are blocked waiting for I/O. Let's examine disk performance:
# Check disk latency with iostat -x 2
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 2.00 0.00 3.00 0.00 16.00 10.67 0.03 10.00 5.00 1.50
1. Identify blocked processes:
ps -eo state,pid,cmd | grep "^D" # Look for D state (uninterruptible sleep)
2. Check filesystem mounts for noatime:
mount | grep -E 'ext[34]|xfs'
3. Monitor disk I/O in real-time:
iotop -oP # Show only active I/O operations
Journaling overhead: For ext4 filesystems, adjust journaling:
tune2fs -O ^has_journal /dev/sdX
Swappiness: Even with free memory, check:
sysctl vm.swappiness=10
Use SystemTap to trace I/O waits:
# systemtap script to track blocked processes
probe kernel.function("submit_bio") {
printf("%d %s blocked for I/O\n", pid(), execname())
}
Adjust vm.dirty_ratio to prevent write spikes:
echo 20 > /proc/sys/vm/dirty_ratio
echo 10 > /proc/sys/vm/dirty_background_ratio
A MySQL server showed similar symptoms. The solution was to:
innodb_flush_method=O_DIRECT
innodb_io_capacity=2000