Diagnosing High Linux Load Averages Despite Low CPU/Memory Utilization: A Developer’s Guide to System Bottlenecks


4 views

Linux load averages represent the average number of processes in the run queue over time periods (1, 5, and 15 minutes). A load of 19.79 on your 16-core system suggests significant process contention, while CPU utilization shows 95% idle. This discrepancy indicates blocked processes waiting for resources that aren't CPU cycles.

First, verify disk I/O bottlenecks with:

# Check disk latency
iostat -xm 2
# Detailed process I/O
iotop -oPa

For network-related waits:

# Network socket statistics
ss -tulpn
# Detailed TCP info
cat /proc/net/tcp | awk '{print $4}' | sort | uniq -c

Your memory shows:

Mem:  3940MB total, 3910MB used
-/+ buffers/cache: 2445MB used
Swap: 4110MB total, 236MB used

Despite appearing full, Linux aggressively caches files. The key metric is swap usage - minimal here. Check for memory pressure with:

# Page fault statistics
vmstat -s
# Slab memory usage
slabtop -o

Your ps aux output reveals several kernel threads with accumulated CPU time:

  • kswapd1: 94:48 CPU time - memory reclaim
  • Multiple kblockd threads: Block device operations

Check for storage issues with:

# Disk health check
smartctl -a /dev/sda
# Filesystem errors
dmesg | grep -i error

For deeper analysis:

# System-wide tracing
perf record -a -g sleep 10
perf report

# Process-specific waits
strace -p [PID] -c

Verify system limits that might cause process queuing:

# Process limits
sysctl kernel.pid_max
# File handles
cat /proc/sys/fs/file-nr

# IO scheduler
cat /sys/block/sda/queue/scheduler

In one case, high load was caused by NFS timeouts. The solution was:

# /etc/fstab adjustment
nas:/exports /mnt/nfs nfs rw,hard,intr,timeo=600,retrans=2 0 0

# Sysctl tuning
echo "sunrpc.tcp_slot_table_entries=128" >> /etc/sysctl.conf
sysctl -p

Always measure changes with before/after comparisons using sar -q or similar tools.


When your Linux system reports load averages of 19-21 (as shown in the top output) while displaying 95% CPU idle time and minimal memory pressure, you're facing a classic I/O-bound scenario. The system isn't CPU-starved - it's waiting on something else.

# Top output highlights:
Load average: 19.79, 21.25, 18.87
Cpu(s): 95.0%id
Wa: 0.6%  # I/O wait percentage
253 sleeping processes

The 0.6% iowait might seem low, but combined with the high load average, it suggests processes are blocked waiting for I/O. Let's examine disk performance:

# Check disk latency with iostat -x 2
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 2.00 0.00 3.00 0.00 16.00 10.67 0.03 10.00 5.00 1.50

1. Identify blocked processes:

ps -eo state,pid,cmd | grep "^D"  # Look for D state (uninterruptible sleep)

2. Check filesystem mounts for noatime:

mount | grep -E 'ext[34]|xfs'

3. Monitor disk I/O in real-time:

iotop -oP  # Show only active I/O operations

Journaling overhead: For ext4 filesystems, adjust journaling:

tune2fs -O ^has_journal /dev/sdX

Swappiness: Even with free memory, check:

sysctl vm.swappiness=10

Use SystemTap to trace I/O waits:

# systemtap script to track blocked processes
probe kernel.function("submit_bio") {
    printf("%d %s blocked for I/O\n", pid(), execname())
}

Adjust vm.dirty_ratio to prevent write spikes:

echo 20 > /proc/sys/vm/dirty_ratio
echo 10 > /proc/sys/vm/dirty_background_ratio

A MySQL server showed similar symptoms. The solution was to:

innodb_flush_method=O_DIRECT
innodb_io_capacity=2000