When load averages exceed CPU core counts while showing healthy CPU idle percentages (65-70%) and negligible I/O wait (0.02-0.03%), we're dealing with one of Linux's more subtle performance puzzles. The traditional wait queue explanation doesn't apply here, as ps
shows no processes in D
(uninterruptible sleep) state.
# Quick load check
uptime
15:42:30 up 3 days, 2:15, 1 user, load average: 4.61, 4.46, 4.48
# Process state distribution
ps -eo state | sort | uniq -c
173 R
1 S
2 Z
The combination of Ruby 1.9's Global VM Lock and Unicorn's forking model creates unique scheduling behaviors:
# Sample Unicorn worker stuck analysis
strace -p $(pgrep -f 'unicorn.*worker' | head -1) -T -tt -o /tmp/unicorn-trace.log
# Common patterns found:
# 1. Contended mutexes in Ruby land
# 2. PGconn blocking calls (despite fast DB response)
# 3. Garbage collection pauses
ESXi scheduling can introduce hidden latency spikes. Check for CPU ready time:
# ESXTOP metrics to monitor (host-side)
CPU: %RDY - Should be < 5%
CPU: %MLMTD - Should be 0%
MEM: %ACTV - Compare with guest view
Despite modest network throughput (200KB/s), TCP stack issues can manifest as load:
# Socket allocation status
ss -s
Total: 174 (kernel 199)
TCP: 154 (estab 72, closed 66, orphaned 0, synrecv 0, timewait 65/0), ports 0
# Connection latency measurement
tcpping -C -x 5 postgres-host 5432
Here's an actionable checklist to isolate the issue:
#!/bin/bash
# Load correlation script
while true; do
echo "$(date '+%H:%M:%S') $(uptime | awk -F'[a-z]:' '{print $2}') \
$(vmstat 1 2 | tail -1 | awk '{print $1}')" >> /tmp/load_r.txt
sleep 5
done
# Generate flame graphs
perf record -F 99 -p $(pgrep -f unicorn) -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > unicorn.svg
Tune these Unicorn and OS parameters in your environment:
# /etc/sysctl.conf additions
kernel.sched_latency_ns = 10000000
kernel.sched_migration_cost_ns = 5000000
vm.dirty_ratio = 10
vm.swappiness = 1
# unicorn.rb adjustments
worker_processes 3 # For 2-core VM
preload_app true
GC.respond_to?(:copy_on_write_friendly=) &&
GC.copy_on_write_friendly = true
Even with fast queries, connection setup can cause queuing:
# Install pg_bouncer config snippet
[databases]
maindb = host=primary-db port=5432 dbname=app_prod pool_mode=transaction
# Connection pool check
SELECT client_addr, state, count(*)
FROM pg_stat_activity
GROUP BY 1,2;
When examining Linux server performance metrics, we typically expect load average to correlate with CPU utilization. However, in this case we observe load averages consistently exceeding the 2-core capacity (peaking at 6.50) while CPU remains at 55-70% idle. The absence of significant I/O wait (0.02-0.03% iowait) or disk utilization (0.2-0.37% disk util) makes this particularly puzzling.
# Sample sar output showing the discrepancy
12:45:01 PM all 33.44% CPU ldavg-5: 3.31
01:25:01 PM all 41.60% CPU ldavg-5: 6.50
The Ruby 1.9 + Unicorn combination suggests several investigation paths:
# Check Unicorn worker status
$ ps aux | grep unicorn
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
app 12345 2.3 4.7 1023456 245678 ? S 09:00 1:23 unicorn_rails worker[1]
app 12346 2.1 4.6 1022156 244321 ? S 09:00 1:19 unicorn_rails worker[2]
While network throughput appears normal (200KB/s), latency between application and PostgreSQL could cause workers to stall:
# Measure PostgreSQL connection latency
require 'benchmark'
Benchmark.measure { ActiveRecord::Base.connection.execute("SELECT 1") }
# => 0.000500 0.000020 0.000520 ( 0.510520)
The ESXi 5.1 virtualization layer introduces potential CPU scheduling issues:
# Check for CPU steal time
$ grep -E 'model name|steal' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
steal : 0
When standard metrics don't reveal the issue, we need deeper tools:
# Install and run perf for CPU profiling
$ sudo apt-get install linux-tools-common linux-tools-generic
$ sudo perf top -p $(pgrep -f unicorn)
# Sample output showing Ruby GC activity
23.45% ruby [.] gc_mark_children
18.76% ruby [.] rb_vm_exec
12.34% ruby [.] rb_yield_0
Tuning Unicorn and Ruby parameters can help mitigate load issues:
# unicorn.rb configuration adjustments
worker_processes 2 # Match vCPU count
timeout 30 # Reduce from default 60
preload_app true # Enable copy-on-write
# Ruby GC tuning
ENV['RUBY_GC_HEAP_GROWTH_FACTOR'] = '1.1'
ENV['RUBY_GC_MALLOC_LIMIT'] = '40000000'
Traditional tools might miss Ruby-specific bottlenecks:
# Install rbtrace for live process inspection
$ gem install rbtrace
$ rbtrace -p UNICORN_PID -m 'GC.stat'
# Sample output showing memory pressure
{
:count=>23,
:heap_allocated_pages=>100,
:heap_sorted_length=>132,
:total_allocated_objects=>4523567,
:malloc_increase_bytes=>1203456
}