Diagnosing High Load Average with Low CPU Usage in Java WebLogic Environments on Solaris


2 views

When your Solaris server shows high load averages (54-63) while CPU utilization remains under 5%, you're facing a classic performance puzzle. The key indicators from your diagnostic outputs reveal:

# prstat -Z shows
Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11

Yet vmstat reports 92-98% idle time. This discrepancy points to resource contention that isn't CPU-bound.

Examining your Java processes (WebLogic domains) shows interesting thread patterns:

java/225  # One domain with 225 threads
java/209  # Another with 209 threads
...
# Total LWPs (light-weight processes): 3167

The high thread count suggests possible:

  • Thread pool saturation
  • Lock contention
  • Blocked I/O operations

Your netstat -i output shows healthy network throughput without errors:

input   aggr26    output       input  (Total)    output
1500233798 0     1489316495 0     0      3608008314 0     3586173708 0     0

This rules out network bottlenecks as the primary issue.

To pinpoint Java-level issues, run these commands for each WebLogic domain:

# Get thread dumps (run 3-5 times at 10s intervals)
kill -3 <java_pid>

# Check GC behavior
jstat -gcutil <java_pid> 1000 10

Example output to watch for:

S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
0.00  96.88  63.78  41.63  99.99   1309   75.021    5    3.104   78.125

Common configuration issues in WebLogic 10.3.5 that could cause this:

  1. Execute Queue Threads:
    <max-threads-constraint>
      <name>MaxThreadsConstraint</name>
      <count>100</count>
    </max-threads-constraint>
  2. Stuck Thread Detection:
    <stuck-thread-max-time>600</stuck-thread-max-time>
    <stuck-thread-timer-interval>60</stuck-thread-timer-interval>

Since your apps connect to Oracle, check connection pool usage:

# In WebLogic console:
JDBCDataSourceRuntimeMBean.getWaitingForConnectionCurrentCount()
JDBCDataSourceRuntimeMBean.getWaitingForConnectionHighCount()

Sample problematic pattern:

# High wait counts indicate pool saturation
WaitingForConnectionCurrentCount: 142
WaitingForConnectionHighCount: 150

Solaris handles Java threads differently than Linux. Consider these tunables:

# /etc/system parameters
set max_nprocs=30000
set maxuprc=15000
set rlim_fd_max=8192
set rlim_fd_cur=4096

Also verify project limits for WebLogic users:

prctl -n project.max-lwps -i project <project_id>

For deeper analysis, use these Solaris tools:

# Show system call activity
dtrace -n 'syscall:::entry { @[execname] = count(); }'

# Monitor thread state distribution
prstat -mLc -p <java_pid> 5

Example thread state breakdown:

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
3836 ducm0101 0.1 0.0 0.0 0.0 0.0 2.1 0.0 97.8 15k 5.2k 0 0 java/1
3836 ducm0101 0.0 0.0 0.0 0.0 0.0 98.2 0.0 1.8 2.1k 1.3k 0 0 java/225
  1. Capture 5 thread dumps at 10-second intervals
  2. Verify JDBC connection pool sizing
  3. Check for filesystem waits (NFS mounts?)
  4. Monitor Solaris kernel dispatcher activity
  5. Review WebLogic work manager configurations

When your Solaris system shows load averages spiking to 50-60 while CPU usage remains below 5%, we're looking at a classic case of resource contention that isn't immediately visible through standard monitoring tools. The key indicators from your diagnostic outputs:

// prstat shows multiple Java processes with hundreds of threads
PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
3836 ducm0101 2119M 2074M cpu348  58    0   8:41:56 0.5% java/225
24196 ducm0101 1974M 1910M sleep   59    0   4:04:33 0.4% java/209

With 8 WebLogic domains running Java applications, thread synchronization becomes the prime suspect. The high load average indicates threads waiting on:

  • Database connection pools (despite being on a separate server)
  • JVM garbage collection pauses (even without visible CPU spikes)
  • Application-level locks or synchronized blocks

Standard Linux tools won't reveal the full picture. Try these Solaris-specific approaches:

# Check for thread blocking in JVMs
dtrace -n 'profile-97 /pid == $target/ { @[ustack()] = count(); }' -p <java_pid>

# Monitor LWPs (light-weight processes)
mpstat -a 15

Add these JMX queries to your monitoring:

// WebLogic JDBC pool monitoring MBean
ObjectName poolName = new ObjectName("com.bea:ServerRuntime=AdminServer,Name=YOUR_POOL,Type=JDBCConnectionPoolRuntime");
Integer waiting = (Integer) mbeanServer.getAttribute(poolName, "WaitingForConnectionCurrentCount");
Integer active = (Integer) mbeanServer.getAttribute(poolName, "ActiveConnectionsCurrentCount");

Your Java 1.6 setup might benefit from these JVM flags:

-XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled 
-XX:+UseCMSInitiatingOccupancyOnly 
-XX:CMSInitiatingOccupancyFraction=70 
-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution

Despite netstat showing no errors, add these checks:

# Check for TCP retransmits
netstat -s -P tcp | grep retrans

# Oracle SQL*Net latency
SELECT sql_id, elapsed_time/executions/1000 "ms_per_exec" 
FROM v$sqlarea 
WHERE executions > 1000 
ORDER BY 2 DESC;
  1. Reduce WebLogic domain count from 8 to 4-5 on this server
  2. Implement request timeouts in your Java code
  3. Add connection pool validation queries
  4. Enable WebLogic work manager constraints