Diagnosing Asymmetric CPU Temperatures in Dual-Opteron Virtualization Servers: Thermal Analysis & Linux lm-sensors Fixes


2 views

When monitoring my dual-Opteron KGPE-D16 server running KVM/libvirt, I noticed a disturbing thermal pattern:

# sensors output
k10temp-pci-00c3
Adapter: PCI adapter
Tdie: +2.0°C (high = +70.0°C)

k10temp-pci-00cb
Adapter: PCI adapter
Tdie: +13.0°C (high = +70.0°C)

Under load, this difference becomes more pronounced (69°C vs 15°C), suggesting either a hardware issue or sensor misconfiguration.

First, we need to verify if the readings are accurate or if lm-sensors needs recalibration. Here's how to cross-check:

# Check raw thermal data
cat /sys/class/hwmon/hwmon*/temp*_input

# Alternative monitoring tools
sudo apt install psensor
psensor --debug

The ASUS KGPE-D16 motherboard has known thermal reporting quirks. Key checks:

  • Verify both Noctua NH-U9DO coolers are properly seated
  • Check thermal paste application (recommend Arctic MX-4)
  • Inspect airflow path for obstructions

Use virsh and numactl to check CPU affinity:

# List VM CPU pinning
virsh vcpuinfo [VM_NAME] | grep -i affinity

# Check NUMA node distribution
numastat -c qemu-kvm

The KGPE-D16 BIOS (version 0702+) contains thermal management settings:

# Dump current BIOS settings
sudo dmidecode -t bios

# Recommended adjustments:
# - Disable "CPU Spread Spectrum"
# - Set "CPU Power Duty Control" to T.Probe
# - Enable "CPU Fan Full Speed Mode"

For continuous monitoring, deploy this Python script:

#!/usr/bin/env python3
import subprocess
from time import sleep

def get_cpu_temp():
    output = subprocess.check_output(['sensors']).decode()
    temps = [float(line.split('+')[1].split('°')[0]) 
             for line in output.split('\n') if 'Tdie' in line]
    return temps

while True:
    t1, t2 = get_cpu_temp()
    delta = abs(t1 - t2)
    if delta > 10:  # Alert threshold
        print(f"WARNING: Thermal imbalance detected! CPU1: {t1}°C, CPU2: {t2}°C")
    sleep(10)


When monitoring my AMD Opteron-based virtualization server using lm-sensors, I noticed a significant thermal asymmetry:

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +69.0°C  (high = +70.0°C, crit = +95.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Core 0:       +15.0°C  (high = +70.0°C, crit = +95.0°C)

First, we need to confirm these readings aren't sensor artifacts. The ASUS KGPE-D16 motherboard has known quirks with thermal reporting:

# Cross-validate with alternative tools
sudo apt install psensor
psensor --detect

After verification, the temperature delta persists across monitoring tools, suggesting a genuine hardware-level issue rather than sensor misreporting.

Using mpstat to check CPU utilization patterns:

mpstat -P ALL 1 5
# Sample output:
05:43:27 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:43:28 PM  all   38.12    0.00   12.50    0.00    0.00    1.25    0.00   48.12    0.00    0.00
05:43:28 PM    0   76.00    0.00   24.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:43:28 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00

The output shows CPU0 handling all active threads while CPU1 remains idle, explaining the thermal difference.

On dual-socket Opteron systems, proper NUMA configuration is crucial. Check current settings:

numactl --hardware
# And for VMs:
virsh vcpuinfo [vm_name] | grep -i affinity

For Noctua NH-U9DO coolers, ensure proper mounting:

  1. Power down and physically inspect both heatsink installations
  2. Check thermal paste application (recommend Arctic MX-4 for Opterons)
  3. Verify fan rotation speed matches between both units:
sensors | grep -i fan
# Should show similar RPM values for both CPU fans

Access the KGPE-D16 BIOS and verify:

  • Power distribution settings (try disabling C-states if enabled)
  • VRM phase control configuration
  • Individual CPU fan control curves

Modify libvirt XML to enforce proper CPU pinning:

<vcpu placement='static'>4</vcpu>
<cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <!-- Additional vCPUs -->
</cputune>
<numatune>
    <memory mode='strict' nodeset='0'/>
</numatune>

After implementing these changes, monitor temperatures over 24 hours to verify improvement.