Debugging High CPU Load When Top Processes Show 0% Utilization in Linux Systems


2 views

This is a classic case where system monitoring tools tell conflicting stories. The load average (shown as 28.14 in your screenshot) indicates significant system stress, while top shows most processes consuming minimal CPU. Here are the key possibilities:


$ uptime
 16:32:45 up 12 days,  3:45,  2 users,  load average: 28.14, 25.67, 20.33

The standard top command might not reveal the full picture. Try these alternatives:

1. Use htop with thread view


$ sudo htop
# Press F2 → Display → check "Tree view" and "Show custom thread names"
# Press F4 and filter for state "D" (uninterruptible sleep)

2. Check for Disk I/O Bottlenecks


$ sudo iotop -oP
$ sudo dstat -td --disk-util --top-bio

3. Examine Kernel Threads


$ ps -eLf | grep -v "\["
$ top -H -p $(pgrep kworker)

Based on the screenshot and common scenarios:

  • Disk I/O Wait: Check %wa in top's CPU summary line
  • Kernel Workers: Look for kworker processes consuming CPU
  • Interrupt Storm: Check watch -n1 "cat /proc/interrupts"
  • Memory Pressure: Examine vmstat 1 and free -h

Perf Tool Analysis


$ sudo perf top
$ sudo perf stat -a sleep 10
$ sudo perf record -a -g -- sleep 30

SystemTap Script Example


probe kernel.function("*@fs/*.c") {
    if (execname() == "java") {
        printf("%s %s\n", probefunc(), execname())
    }
}

BCC Tools Investigation


$ sudo /usr/share/bcc/tools/cpudist
$ sudo /usr/share/bcc/tools/offcputime -K
$ sudo /usr/share/bcc/tools/runqlat

In one production incident, we discovered:

  1. High load (35+) with idle CPUs
  2. vmstat showed 100% wa
  3. iotop revealed no processes doing I/O
  4. Solution: Bad sector on SSD causing kernel retries

For persistent monitoring, consider this Prometheus exporter configuration:


- job_name: 'node_advanced'
  static_configs:
    - targets: ['localhost:9100']
  params:
    collect[]:
      - cpu
      - diskstats
      - interrupts
      - softnet
      - pressure

When you see high CPU load averages (e.g., 15-20) but individual process CPU usage in top shows minimal utilization (0-2%), this typically indicates one of these scenarios:

# Quick diagnostic command sequence
watch -n 1 "uptime; ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 10"

From my experience troubleshooting production servers, these are the most likely causes:

1. Zombie Processes Accumulation

Zombie processes (state 'Z') can artificially inflate load averages. Check with:

ps aux | awk '{print $8}' | grep -c Z

2. Disk I/O Bottlenecks

When processes are blocked waiting for I/O, they consume CPU cycles but show minimal %CPU. Check with:

iotop -oP

3. Short-lived Processes

Processes that spawn/die too quickly for top to catch. Monitor with:

sudo perf top -g

Here's my go-to toolkit for these situations:

# Install necessary tools first
sudo apt-get install sysstat dstat strace ltrace perf

# Real-time monitoring
dstat -tlacdn --top-cpu --top-mem --top-io

Using SystemTap for Deep Inspection

For production systems where you need minimal overhead:

# SystemTap script to catch short-lived processes
global process_count

probe begin {
    printf("Monitoring process creation...\n")
}

probe scheduler.cpu_on {
    process_count[pid()]++
}

probe timer.s(10) {
    foreach (pid in process_count-) {
        printf("%d created %d times in 10s\n", pid, process_count[pid])
    }
    delete process_count
}

When userspace tools don't reveal the issue:

# Check kernel worker threads
ps -eLo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm | awk '$3 != "TS"'

Interrupt Analysis

High interrupt rates can manifest as CPU load:

watch -n 1 'cat /proc/interrupts | sort -rnk 4 | head -20'

Recently diagnosed a server with 18.75 load average but minimal process CPU:

  1. Ran pidstat 1 showing frequent mmap() system calls
  2. Used strace -c -p [PID] to identify offending process
  3. Discovered a misconfigured logging service creating thousands of temp files

The fix was simple:

# Adjusted sysctl parameters
echo "vm.dirty_ratio = 10" >> /etc/sysctl.conf
echo "vm.dirty_background_ratio = 5" >> /etc/sysctl.conf
sysctl -p

Implement these monitoring solutions:

# Sample Nagios check for zombie processes
define command {
    command_name check_zombies
    command_line /usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z
}