This is a classic case where system monitoring tools tell conflicting stories. The load average (shown as 28.14 in your screenshot) indicates significant system stress, while top
shows most processes consuming minimal CPU. Here are the key possibilities:
$ uptime
16:32:45 up 12 days, 3:45, 2 users, load average: 28.14, 25.67, 20.33
The standard top
command might not reveal the full picture. Try these alternatives:
1. Use htop
with thread view
$ sudo htop
# Press F2 → Display → check "Tree view" and "Show custom thread names"
# Press F4 and filter for state "D" (uninterruptible sleep)
2. Check for Disk I/O Bottlenecks
$ sudo iotop -oP
$ sudo dstat -td --disk-util --top-bio
3. Examine Kernel Threads
$ ps -eLf | grep -v "\["
$ top -H -p $(pgrep kworker)
Based on the screenshot and common scenarios:
- Disk I/O Wait: Check
%wa
intop
's CPU summary line - Kernel Workers: Look for
kworker
processes consuming CPU - Interrupt Storm: Check
watch -n1 "cat /proc/interrupts"
- Memory Pressure: Examine
vmstat 1
andfree -h
Perf Tool Analysis
$ sudo perf top
$ sudo perf stat -a sleep 10
$ sudo perf record -a -g -- sleep 30
SystemTap Script Example
probe kernel.function("*@fs/*.c") {
if (execname() == "java") {
printf("%s %s\n", probefunc(), execname())
}
}
BCC Tools Investigation
$ sudo /usr/share/bcc/tools/cpudist
$ sudo /usr/share/bcc/tools/offcputime -K
$ sudo /usr/share/bcc/tools/runqlat
In one production incident, we discovered:
- High load (35+) with idle CPUs
vmstat
showed 100%wa
iotop
revealed no processes doing I/O- Solution: Bad sector on SSD causing kernel retries
For persistent monitoring, consider this Prometheus exporter configuration:
- job_name: 'node_advanced'
static_configs:
- targets: ['localhost:9100']
params:
collect[]:
- cpu
- diskstats
- interrupts
- softnet
- pressure
When you see high CPU load averages (e.g., 15-20) but individual process CPU usage in top
shows minimal utilization (0-2%), this typically indicates one of these scenarios:
# Quick diagnostic command sequence
watch -n 1 "uptime; ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 10"
From my experience troubleshooting production servers, these are the most likely causes:
1. Zombie Processes Accumulation
Zombie processes (state 'Z') can artificially inflate load averages. Check with:
ps aux | awk '{print $8}' | grep -c Z
2. Disk I/O Bottlenecks
When processes are blocked waiting for I/O, they consume CPU cycles but show minimal %CPU. Check with:
iotop -oP
3. Short-lived Processes
Processes that spawn/die too quickly for top
to catch. Monitor with:
sudo perf top -g
Here's my go-to toolkit for these situations:
# Install necessary tools first
sudo apt-get install sysstat dstat strace ltrace perf
# Real-time monitoring
dstat -tlacdn --top-cpu --top-mem --top-io
Using SystemTap for Deep Inspection
For production systems where you need minimal overhead:
# SystemTap script to catch short-lived processes
global process_count
probe begin {
printf("Monitoring process creation...\n")
}
probe scheduler.cpu_on {
process_count[pid()]++
}
probe timer.s(10) {
foreach (pid in process_count-) {
printf("%d created %d times in 10s\n", pid, process_count[pid])
}
delete process_count
}
When userspace tools don't reveal the issue:
# Check kernel worker threads
ps -eLo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm | awk '$3 != "TS"'
Interrupt Analysis
High interrupt rates can manifest as CPU load:
watch -n 1 'cat /proc/interrupts | sort -rnk 4 | head -20'
Recently diagnosed a server with 18.75 load average but minimal process CPU:
- Ran
pidstat 1
showing frequent mmap() system calls - Used
strace -c -p [PID]
to identify offending process - Discovered a misconfigured logging service creating thousands of temp files
The fix was simple:
# Adjusted sysctl parameters
echo "vm.dirty_ratio = 10" >> /etc/sysctl.conf
echo "vm.dirty_background_ratio = 5" >> /etc/sysctl.conf
sysctl -p
Implement these monitoring solutions:
# Sample Nagios check for zombie processes
define command {
command_name check_zombies
command_line /usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z
}