When monitoring my dual-Opteron KGPE-D16 server running KVM/libvirt, I noticed a disturbing thermal pattern:
# sensors output
k10temp-pci-00c3
Adapter: PCI adapter
Tdie: +2.0°C (high = +70.0°C)
k10temp-pci-00cb
Adapter: PCI adapter
Tdie: +13.0°C (high = +70.0°C)
Under load, this difference becomes more pronounced (69°C vs 15°C), suggesting either a hardware issue or sensor misconfiguration.
First, we need to verify if the readings are accurate or if lm-sensors needs recalibration. Here's how to cross-check:
# Check raw thermal data
cat /sys/class/hwmon/hwmon*/temp*_input
# Alternative monitoring tools
sudo apt install psensor
psensor --debug
The ASUS KGPE-D16 motherboard has known thermal reporting quirks. Key checks:
- Verify both Noctua NH-U9DO coolers are properly seated
- Check thermal paste application (recommend Arctic MX-4)
- Inspect airflow path for obstructions
Use virsh
and numactl
to check CPU affinity:
# List VM CPU pinning
virsh vcpuinfo [VM_NAME] | grep -i affinity
# Check NUMA node distribution
numastat -c qemu-kvm
The KGPE-D16 BIOS (version 0702+) contains thermal management settings:
# Dump current BIOS settings
sudo dmidecode -t bios
# Recommended adjustments:
# - Disable "CPU Spread Spectrum"
# - Set "CPU Power Duty Control" to T.Probe
# - Enable "CPU Fan Full Speed Mode"
For continuous monitoring, deploy this Python script:
#!/usr/bin/env python3
import subprocess
from time import sleep
def get_cpu_temp():
output = subprocess.check_output(['sensors']).decode()
temps = [float(line.split('+')[1].split('°')[0])
for line in output.split('\n') if 'Tdie' in line]
return temps
while True:
t1, t2 = get_cpu_temp()
delta = abs(t1 - t2)
if delta > 10: # Alert threshold
print(f"WARNING: Thermal imbalance detected! CPU1: {t1}°C, CPU2: {t2}°C")
sleep(10)
When monitoring my AMD Opteron-based virtualization server using lm-sensors, I noticed a significant thermal asymmetry:
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +69.0°C (high = +70.0°C, crit = +95.0°C)
coretemp-isa-0001
Adapter: ISA adapter
Core 0: +15.0°C (high = +70.0°C, crit = +95.0°C)
First, we need to confirm these readings aren't sensor artifacts. The ASUS KGPE-D16 motherboard has known quirks with thermal reporting:
# Cross-validate with alternative tools
sudo apt install psensor
psensor --detect
After verification, the temperature delta persists across monitoring tools, suggesting a genuine hardware-level issue rather than sensor misreporting.
Using mpstat to check CPU utilization patterns:
mpstat -P ALL 1 5
# Sample output:
05:43:27 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
05:43:28 PM all 38.12 0.00 12.50 0.00 0.00 1.25 0.00 48.12 0.00 0.00
05:43:28 PM 0 76.00 0.00 24.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:28 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00
The output shows CPU0 handling all active threads while CPU1 remains idle, explaining the thermal difference.
On dual-socket Opteron systems, proper NUMA configuration is crucial. Check current settings:
numactl --hardware
# And for VMs:
virsh vcpuinfo [vm_name] | grep -i affinity
For Noctua NH-U9DO coolers, ensure proper mounting:
- Power down and physically inspect both heatsink installations
- Check thermal paste application (recommend Arctic MX-4 for Opterons)
- Verify fan rotation speed matches between both units:
sensors | grep -i fan
# Should show similar RPM values for both CPU fans
Access the KGPE-D16 BIOS and verify:
- Power distribution settings (try disabling C-states if enabled)
- VRM phase control configuration
- Individual CPU fan control curves
Modify libvirt XML to enforce proper CPU pinning:
<vcpu placement='static'>4</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<!-- Additional vCPUs -->
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
After implementing these changes, monitor temperatures over 24 hours to verify improvement.