When dealing with unexpected server shutdowns, the first step is gathering all available data points. In this case, we have:
- Debian 6 running on Xeon 55XX hardware
- 4 SSDs configured in RAID 10
- Two unexplained shutdowns within two weeks
- No apparent correlation with load (average load ~1)
- No indications of power outages from colocation facility
Start with these essential log files (all paths relative to /var/log):
# Check system messages
sudo cat messages | grep -i -E "error|fail|shut|power|thermal"
# Kernel logs are particularly important
sudo cat kern.log | grep -i -E "panic|oops|thermal|temperature|hard reset"
# Authentication logs (in case of remote access issues)
sudo cat auth.log | grep -i -E "session|shutdown|poweroff"
For Xeon 55XX servers with SSD RAID arrays:
# Check RAID status
sudo mdadm --detail /dev/md0
# SSD health (assuming your SSDs are /dev/sd[a-d])
for disk in /dev/sd[a-d]; do
sudo smartctl -a $disk | grep -i -E "error|fail|reallocated|temperature"
done
# CPU temperature history (if lm-sensors is installed)
sensors | grep Core
Debian 6 uses syslog by default. To find shutdown-related entries:
# Search for shutdown events in syslog
sudo grep -i "shut" /var/log/syslog* | sort -k3M -k4
Look for patterns like:
- Hardware errors (EDAC messages for memory)
- Thermal shutdown triggers
- Filesystem errors that might cause panics
- ACPI power events (especially important for Xeon systems)
For deeper investigation, these commands can help:
# Check dmesg for hardware errors
dmesg -T | grep -i -E "error|fail|critical|temperature"
# Verify ACPI events
acpi_listen # Run this before potential shutdown window
# Check for kernel panics (if system reboots)
sudo grep -i "kernel panic" /var/log/syslog*
# Memory diagnostics (run when system is stable)
sudo memtester 2G 1
Based on similar cases with Xeon 55XX servers:
- SSD firmware issues causing controller resets
- Memory errors (especially with ECC memory)
- Power supply voltage fluctuations
- Kernel bugs with specific RAID controllers
- ACPI power management misconfiguration
If logs don't reveal clear answers, try these proactive measures:
# Disable aggressive power saving in BIOS
# Check current settings:
sudo dmidecode | grep -i "power"
# Update SSD firmware (example for Intel SSDs)
sudo intelmas show -intelssd
sudo intelmas load -intelssd 0 -f firmware_image.bin
# Modify kernel parameters temporarily
sudo nano /etc/default/grub
# Add these parameters to GRUB_CMDLINE_LINUX:
# "noapic nolapic acpi=off" (test one at a time)
When investigating unexpected shutdowns, start with these critical log files:
# Check system messages sudo cat /var/log/messages | grep -i "error\|fail\|shut" # Kernel logs are goldmines sudo cat /var/log/kern.log | grep -i "panic\|oop\|thermal" # Authentication logs (in case of remote access issues) sudo cat /var/log/auth.log | grep -i "session\|ssh"
For Xeon 55XX servers with RAID 10 SSD arrays:
# Check RAID status sudo mdadm --detail /dev/md0 # SSD health (SMART data) sudo smartctl -a /dev/sda | grep -i "reallocated\|pending\|uncorrectable" # Memory diagnostics sudo dmidecode -t memory | grep -i "size\|speed\|manufacturer" sudo memtester 1G 1
Even without reported power outages:
# Check last power events sudo ipmitool sel list | grep -i "power" # Thermal thresholds sudo sensors | grep -i "temp\|fan" # Kernel ring buffer for thermal events sudo dmesg | grep -i "thermal\|throttl"
Configure kdump for future incidents:
# Install kdump tools sudo apt-get install linux-crashdump kdump-tools # Configure crash kernel memory sudo nano /etc/default/grub # Add: crashkernel=128M # After reboot, verify cat /proc/cmdline | grep crashkernel
Create a watchdog script to capture pre-crash state:
#!/bin/bash LOGFILE=/var/log/system_watchdog.log while true; do TIMESTAMP=$(date +"%Y-%m-%d %T") LOAD=$(cat /proc/loadavg) MEM=$(free -m) DISK=$(df -h) RAID=$(mdadm --detail /dev/md0) TEMP=$(sensors) echo "===== ${TIMESTAMP} =====" >> $LOGFILE echo -e "Load:\n${LOAD}\n\nMemory:\n${MEM}\n\nDisk:\n${DISK}\n\nRAID:\n${RAID}\n\nTemp:\n${TEMP}" >> $LOGFILE sleep 300 done
For collocated servers:
# Check SSH connection stability sudo cat /var/log/syslog | grep -i "sshd" # Network interface errors sudo ethtool -S eth0 | grep -i "error\|drop" # Configure serial console as backup sudo systemctl enable serial-getty@ttyS0.service