Troubleshooting Random Shutdowns in Debian 6 on Xeon 55XX Server with RAID 10 SSDs


2 views

When dealing with unexpected server shutdowns, the first step is gathering all available data points. In this case, we have:

  • Debian 6 running on Xeon 55XX hardware
  • 4 SSDs configured in RAID 10
  • Two unexplained shutdowns within two weeks
  • No apparent correlation with load (average load ~1)
  • No indications of power outages from colocation facility

Start with these essential log files (all paths relative to /var/log):

# Check system messages
sudo cat messages | grep -i -E "error|fail|shut|power|thermal"

# Kernel logs are particularly important
sudo cat kern.log | grep -i -E "panic|oops|thermal|temperature|hard reset"

# Authentication logs (in case of remote access issues)
sudo cat auth.log | grep -i -E "session|shutdown|poweroff"

For Xeon 55XX servers with SSD RAID arrays:

# Check RAID status
sudo mdadm --detail /dev/md0

# SSD health (assuming your SSDs are /dev/sd[a-d])
for disk in /dev/sd[a-d]; do
    sudo smartctl -a $disk | grep -i -E "error|fail|reallocated|temperature"
done

# CPU temperature history (if lm-sensors is installed)
sensors | grep Core

Debian 6 uses syslog by default. To find shutdown-related entries:

# Search for shutdown events in syslog
sudo grep -i "shut" /var/log/syslog* | sort -k3M -k4

Look for patterns like:

  • Hardware errors (EDAC messages for memory)
  • Thermal shutdown triggers
  • Filesystem errors that might cause panics
  • ACPI power events (especially important for Xeon systems)

For deeper investigation, these commands can help:

# Check dmesg for hardware errors
dmesg -T | grep -i -E "error|fail|critical|temperature"

# Verify ACPI events
acpi_listen  # Run this before potential shutdown window

# Check for kernel panics (if system reboots)
sudo grep -i "kernel panic" /var/log/syslog*

# Memory diagnostics (run when system is stable)
sudo memtester 2G 1

Based on similar cases with Xeon 55XX servers:

  1. SSD firmware issues causing controller resets
  2. Memory errors (especially with ECC memory)
  3. Power supply voltage fluctuations
  4. Kernel bugs with specific RAID controllers
  5. ACPI power management misconfiguration

If logs don't reveal clear answers, try these proactive measures:

# Disable aggressive power saving in BIOS
# Check current settings:
sudo dmidecode | grep -i "power"

# Update SSD firmware (example for Intel SSDs)
sudo intelmas show -intelssd
sudo intelmas load -intelssd 0 -f firmware_image.bin

# Modify kernel parameters temporarily
sudo nano /etc/default/grub
# Add these parameters to GRUB_CMDLINE_LINUX:
# "noapic nolapic acpi=off" (test one at a time)

When investigating unexpected shutdowns, start with these critical log files:

# Check system messages
sudo cat /var/log/messages | grep -i "error\|fail\|shut"

# Kernel logs are goldmines
sudo cat /var/log/kern.log | grep -i "panic\|oop\|thermal"

# Authentication logs (in case of remote access issues)
sudo cat /var/log/auth.log | grep -i "session\|ssh"

For Xeon 55XX servers with RAID 10 SSD arrays:

# Check RAID status
sudo mdadm --detail /dev/md0

# SSD health (SMART data)
sudo smartctl -a /dev/sda | grep -i "reallocated\|pending\|uncorrectable"

# Memory diagnostics
sudo dmidecode -t memory | grep -i "size\|speed\|manufacturer"
sudo memtester 1G 1

Even without reported power outages:

# Check last power events
sudo ipmitool sel list | grep -i "power"

# Thermal thresholds
sudo sensors | grep -i "temp\|fan"

# Kernel ring buffer for thermal events
sudo dmesg | grep -i "thermal\|throttl"

Configure kdump for future incidents:

# Install kdump tools
sudo apt-get install linux-crashdump kdump-tools

# Configure crash kernel memory
sudo nano /etc/default/grub
# Add: crashkernel=128M

# After reboot, verify
cat /proc/cmdline | grep crashkernel

Create a watchdog script to capture pre-crash state:

#!/bin/bash
LOGFILE=/var/log/system_watchdog.log

while true; do
    TIMESTAMP=$(date +"%Y-%m-%d %T")
    LOAD=$(cat /proc/loadavg)
    MEM=$(free -m)
    DISK=$(df -h)
    RAID=$(mdadm --detail /dev/md0)
    TEMP=$(sensors)
    
    echo "===== ${TIMESTAMP} =====" >> $LOGFILE
    echo -e "Load:\n${LOAD}\n\nMemory:\n${MEM}\n\nDisk:\n${DISK}\n\nRAID:\n${RAID}\n\nTemp:\n${TEMP}" >> $LOGFILE
    sleep 300
done

For collocated servers:

# Check SSH connection stability
sudo cat /var/log/syslog | grep -i "sshd"

# Network interface errors
sudo ethtool -S eth0 | grep -i "error\|drop"

# Configure serial console as backup
sudo systemctl enable serial-getty@ttyS0.service