Diagnosing and Troubleshooting Nonpaged Pool Memory Leaks in Windows Server Applications


7 views

When dealing with nonpaged pool memory leaks on Windows servers, the challenge often lies in identifying which specific application or driver is responsible. In this case, we've observed HTTP 503 errors and connection refusals traced back to nonpaged pool exhaustion, with Poolmon.exe indicating significant allocations under the 'Even' tag.

Beyond basic Poolmon usage, here are more sophisticated approaches:

// PowerShell script to monitor nonpaged pool usage by process
Get-Counter '\Process(*)\Pool Nonpaged Bytes' -Continuous | 
Where-Object {$_.CounterSamples.CookedValue -gt 1048576} | 
Format-Table -AutoSize

For unknown driver tags like [< unknown >Event objects], we can use Windbg (even without full installation):

!poolused 2    // Show nonpaged pool usage sorted by tag
!poolfind Even // Find specific allocations with our tag

Create a custom ETW session to track pool allocations:

xperf -start PoolSession -on PoolTrace -BufferSize 1024 -MinBuffers 128 -MaxBuffers 256
xperf -stop PoolSession -d PoolTrace.etl
xperf -i PoolTrace.etl -o PoolTrace.txt -a pool

Consider these specialized tools:

  • RAMMap from Sysinternals
  • WPA (Windows Performance Analyzer)
  • Driver Verifier in special pool mode

For live servers where process termination isn't an option:

  1. Set up performance counters to log pool usage over time
  2. Use kernel debugging remotely
  3. Implement controlled service restarts during maintenance windows

When analyzing the 'Even' tag specifically:

Metric Value Interpretation
Allocs 51,231,806 High allocation frequency
Diff 684,922 Significant leak rate
Per Alloc 48 bytes Small, frequent allocations

When your production server starts throwing 503 errors due to nonpaged pool exhaustion, you know you're in for some serious debugging. The process isn't straightforward, but through systematic investigation we can identify the culprit.

In our case, the smoking gun was found in httperr.log showing numerous Connections_Refused errors. Poolmon.exe revealed the problematic memory allocation tag:

Memory Tag Analysis:
Tag   Type    Allocs       Frees       Diff       Bytes      Per Alloc
Even  Nonp  51,231,806   50,633,533   684,922   32,878,688      48

Beyond Poolmon, these tools provide crucial insights:

# PowerShell command to track nonpaged pool usage
Get-Counter '\Memory\Pool Nonpaged Bytes' -Continuous

# Process Explorer alternative approach:
# 1. Download from Sysinternals
# 2. Add "Nonpaged Pool" column
# 3. Sort processes by memory usage

When basic tools aren't enough, these methods can help:

// Kernel debugger approach (requires Windows SDK)
!poolused 2   // Shows nonpaged pool usage
!poolfind Even // Find allocations with specific tag

// Driver verification method
verifier /flags 0x01 /driver *.sys

Based on our experience resolving this issue:

  1. Monitor Pool Nonpaged Bytes over time (minimum 48 hours)
  2. Check for driver updates, especially network-related
  3. Review recent system changes/updates
  4. Consider third-party driver memory analyzers

To avoid future occurrences:

# Registry tweak to increase nonpaged pool size
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" -Name "NonPagedPoolSize" -Value 0xFFFFFFFF

Remember that rebooting (as we initially did) only provides temporary relief. The real solution requires identifying and fixing the underlying leak source.