Diagnosing High Server Load with Low CPU Usage: NFS vs Disk I/O Bottlenecks

When your server shows load averages spiking to 20-30 while CPU sits at 98% idle, you're dealing with a classic I/O wait scenario. The vmstat output reveals telltale signs:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1      0 1298952      0      0    0    0     0     0    0 9268  7  5 70 19  0

Key observations from your data:

wa (wait) percentage frequently hits double digits (13-19%)
bi (blocks in) shows periodic spikes (240 at maximum)
Load spikes correlate with I/O operations
Context switches (cs) increase dramatically during high-load periods

While NFS could certainly be the culprit, let's gather more evidence before concluding. Try these diagnostic commands:

# Check NFS server statistics
nfsstat -c
nfsstat -s

# Monitor NFS operations in real-time
mountstats /mount/point

# Alternative: check per-mount NFS stats
cat /proc/fs/nfsfs/volumes

For SAN/FC storage issues:

# Check block device latency
iostat -x 1

# Identify processes causing I/O
iotop -oPa

# Check SCSI layer for errors
dmesg | grep -i scsi

Since this is a VPS on FC SAN:

Check for hypervisor-level contention: virsh nodecpustats
Verify SAN queue depth: cat /sys/block/sdX/queue/nr_requests
Monitor multipath I/O if applicable: multipath -ll

Potential fixes to implement and measure:

# For NFS: adjust mount options
mount -o remount,rsize=32768,wsize=32768,async,intr,tcp /nfs/mount

# For general I/O: tune kernel parameters
echo 100 > /proc/sys/vm/dirty_ratio
echo 50 > /proc/sys/vm/dirty_background_ratio
echo 5000 > /proc/sys/vm/dirty_expire_centisecs

Remember to baseline performance before/after changes using:

vmstat 1 10
iostat -x 1 10

When your server shows high load averages (20-30) while maintaining 98% CPU idle time, this typically indicates I/O wait issues. The vmstat output clearly shows significant time spent in the 'wa' state (up to 19%) coinciding with increased I/O operations.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1      0 1298952      0      0    0    0     0     0    0 9268  7  5 70 19  0

Notice how the 'wa' (wait) column spikes during I/O operations. The blocked processes ('b' column) also correlate with these events.

While your VPS uses fiber channel SAN, NFS can still introduce latency due to:

Network round-trip time
NFS server processing time
NFS protocol overhead (especially with sync writes)

To isolate NFS issues, try these commands:

# Check NFS server response times
nfsiostat -d 2 5

# Identify processes waiting on I/O
iotop -o

# Check disk latency
iostat -x 1 5

If NFS is the bottleneck:

# Consider these mount options for performance:
mount -o rsize=65536,wsize=65536,hard,intr,noatime,nodiratime,tcp,nolock [server]:/[export] /mnt

For SAN-related issues:

# Check queue depths and device mapper settings
cat /sys/block/sdX/queue/nr_requests
dmsetup status

Don't overlook other possibilities:

Check for memory pressure (even with free memory shown)
Verify if the hypervisor is throttling I/O
Test with local storage to establish baseline performance

Remember that vmstat's 'wa' includes all I/O wait, not just disk. Network filesystems can trigger this through different mechanisms than local storage.

ServerDevWorker

Diagnosing High Server Load with Low CPU Usage: NFS vs Disk I/O Bottlenecks

Related Articles