Debugging “NMI received for unknown reason 31” Kernel Panic on Debian 7 Wheezy

When working with Debian 7 Wheezy (3.2.0-4-amd64 kernel), you might encounter this critical system message:

kernel:[81927.464687] Uhhuh. NMI received for unknown reason 31 on CPU 3.
kernel:[81927.464743] Do you have a strange power saving mode enabled?
kernel:[81927.464791] Dazed and confused, but trying to continue

Non-Maskable Interrupts (NMIs) are hardware-level signals that can't be ignored by the kernel. Reason code 31 typically indicates:

Hardware failure (CPU/motherboard)
Firmware bugs
Overheating issues
Power supply problems

First, check your system logs for patterns:

# View recent kernel messages
dmesg | grep -i nmi

# Check for hardware errors in system logs
grep -i error /var/log/syslog

Try these kernel boot parameters in /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nmi_watchdog=0"

Update GRUB and reboot:

update-grub
reboot

For production servers, consider these measures:

# Install kernel debug tools
apt-get install linux-image-$(uname -r)-dbg

# Check CPU microcode status
apt-get install intel-microcode
dmesg | grep microcode

Run hardware diagnostics:

# Install stress testing tools
apt-get install stress mprime

# Test CPU stability (run as root)
stress --cpu 4 --timeout 3600

When working with Debian Wheezy (Linux 3.2.0-4-amd64), I encountered a particularly stubborn kernel error:

kernel:[81927.464687] Uhhuh. NMI received for unknown reason 31 on CPU 3.
kernel:[81927.464743] Do you have a strange power saving mode enabled?
kernel:[81927.464791] Dazed and confused, but trying to continue

This Non-Maskable Interrupt (NMI) error would inevitably lead to system reboots, creating significant instability.

Non-Maskable Interrupts are hardware-level signals that even the kernel can't ignore. They typically indicate severe hardware issues or critical system states. The "reason 31" code is particularly interesting as it falls under the "unknown reasons" category in the Linux kernel source (arch/x86/kernel/nmi.c).

Key scenarios that can trigger NMIs:

Hardware watchdog timeouts
Memory parity errors
CPU thermal throttling
Power supply issues

First, let's collect system information:

# Check CPU temperature
sensors

# Verify power supply status (if supported)
cat /sys/class/power_supply/*/status

# Examine kernel ring buffer
dmesg | grep -i nmi

# Check for hardware errors in mcelog
mcelog --ascii

1. Disabling Power Saving Features:

# Temporarily disable CPU idle states
for i in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
    echo 1 > $i
done

2. Adjusting Watchdog Settings:

# Check active watchdogs
ls -l /dev/watchdog*

# Try disabling software watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog

3. BIOS-Level Fixes:

Disable C-states in BIOS
Update to latest BIOS version
Disable Turbo Boost if unstable

Add these to your GRUB configuration:

# Prevent CPU from entering deep C-states
processor.max_cstate=1 intel_idle.max_cstate=0

# Disable specific power management features
idle=poll nmi_watchdog=0

Create a simple monitoring script:

#!/bin/bash
LOG_FILE="/var/log/nmi_monitor.log"
while true; do
    if dmesg | grep -q "NMI received"; then
        echo "[$(date)] NMI detected" >> $LOG_FILE
        lscpu >> $LOG_FILE
        sensors >> $LOG_FILE
    fi
    sleep 60
done

If software solutions fail, consider:

Testing with different RAM modules
Checking CPU socket connections
Verifying power supply voltage stability
Testing with a different motherboard

ServerDevWorker

Debugging “NMI received for unknown reason 31” Kernel Panic on Debian 7 Wheezy

Related Articles