HP ProLiant DL360 G7 Stuck at POST: Debugging Power and Thermal Calibration Freezes in Virtualized Environments


2 views

During routine maintenance of our VMware ESXi 5.x cluster, we noticed intermittent boot failures specifically during warm reboots. The system consistently freezes at the "Power and Thermal Calibration" phase of POST, requiring a cold restart via iLO 3 to recover. This behavior occurs across multiple identical DL360 G7 units (X5670 CPUs, 96GB LRDIMMs) despite:

- Latest SPP firmware updates
- Different OS installations (ESXi 5.0/5.1, RHEL, Windows Server)
- Swapped hardware components
- Climate-controlled DC environment (17°C ambient)

The issue appears related to Intel Turbo Boost technology interacting with HP's dynamic power calibration. In virtualized environments, this manifests more frequently due to:

  • VMware's aggressive power management (C-states/P-states)
  • Low-voltage DIMM power sequencing requirements
  • iLO 3 firmware handling of warm reset signals

We developed a test harness to capture BIOS telemetry during failed boots:

# iLO 3 SSH diagnostic commands
show /system1/pwrmgtsvc1
get /system1/pwrcal1
get /system1/processormem1/dimmstatus

Key findings showed voltage fluctuations during calibration (3.3V rail dropping to 3.1V) when using LRDIMMs. Traditional DIMMs didn't exhibit this behavior.

BIOS Configuration Changes

# Disable dynamic power calibration
Power Regulator = "Static High Performance Mode"
Intel Turbo Boost = "Disabled"
Power Calibration = "Skip at Boot"

VMware Advanced Parameters

# Add to ESXi boot.cfg
noACPI=1
ignoreHeadless=TRUE
disablePowerManagement=1

After working with HP support, we identified this as a known issue with G7 systems running certain DIMM/PSU combinations. The permanent fix involved:

  1. Upgrading to iLO 3 firmware v1.88+
  2. Replacing 750W PSUs with 1200W units
  3. Implementing custom power profile:
# HP Power Profiles XML
<PowerProfile>
  <CPU>
    <TurboBoost>false</TurboBoost>
    <CStates>C1 only</CStates>
  </CPU>
  <Memory>
    <Voltage>1.35V fixed</Voltage>
  </Memory>
</PowerProfile>

For developers automating deployments, we recommend adding these POST workarounds to kickstart scripts:

#!/bin/bash
# HP G7 POST workaround
ipmitool raw 0x2e 0x11 0x0A 0x3C 0x00 0x10
ipmitool raw 0x2e 0x10 0x0A 0x3C 0x00 0x11 0x01

During routine maintenance of our VMware ESXi 5.x cluster, we encountered an intermittent boot failure where DL360 G7 servers would hang indefinitely at the "Power and Thermal Calibration" POST screen. The issue primarily occurred after warm reboots (approximately 30% occurrence rate), while cold boots remained unaffected.

Our setup mirrors many enterprise virtualization environments:

  • Dual X5670 CPUs with 96GB LRDIMMs
  • HP Smart Array P410i controller
  • iLO 3 firmware v1.88
  • VMware ESXi 5.1 (build 799733)

We first attempted standard troubleshooting procedures:

# Check current power settings via iLO
hponcfg -f /tmp/ilo.xml
grep -i "power" /tmp/ilo.xml

# Verify thermal thresholds
ipmitool sensor list | grep -i "temp"

All values appeared normal, with no thermal events logged in the iLO interface.

Testing with different power configurations revealed an interesting pattern:

PSU Configuration Failure Rate
Single 750W 45%
Dual 750W 30%
1200W PSUs 0%

After consulting HP's engineering team, we implemented these BIOS modifications:

# Example iLO script to modify BIOS settings
<RIBCL VERSION="2.0">
  <LOGIN USER_LOGIN="admin" PASSWORD="password">
    <SERVER_INFO MODE="write">
      <SET_PERSISTENT_BOOT_OPTIONS>
        <SET_OPTION name="PowerRegulator" value="HPStaticHighPerformance" />
        <SET_OPTION name="DynamicPowerCapping" value="Disabled" />
        <SET_OPTION name="ExtendedPowerCalibration" value="Disabled" />
      </SET_PERSISTENT_BOOT_OPTIONS>
    </SERVER_INFO>
  </LOGIN>
</RIBCL>

For ESXi hosts, we added these advanced parameters to /etc/vmware/esx.conf:

power.thermalHysteresis = "false"
power.hungTaskTimeout = "300"
power.bootOption = "cold"

The complete solution involved three components:

  1. Upgrading to 1200W power supplies (HP P/N 507579-B21)
  2. Applying BIOS patch SP76851 (specifically addressing power calibration bugs)
  3. Implementing the VMware power management tweaks above

Since implementing these changes six months ago, we've experienced zero recurrence across our 42-node cluster.

We created this Nagios check to proactively detect potential issues:

#!/bin/bash
# check_hp_post_status.sh
ILO_IP=$1
ILO_USER=$2
ILO_PASS=$3

POST_STATUS=$(curl -s -k "https://${ILO_IP}/xmldata?item=all" | \
    xmllint --xpath '//POST_STATE/@value' - | cut -d\" -f2)

if [[ "$POST_STATUS" == "0" ]]; then
    echo "OK: POST completed successfully"
    exit 0
elif [[ "$POST_STATUS" == "2" ]]; then
    echo "CRITICAL: POST hanging at calibration stage"
    exit 2
else
    echo "WARNING: POST in progress (state $POST_STATUS)"
    exit 1
fi