Debugging Unexplained CentOS 7 Server Crashes: RAID, Kernel Logs, and System Diagnostics


22 views

Server crashes without logs are like crime scenes without fingerprints. From your description, the system fails so hard that even kernel logging stops mid-stream. The /var/log/messages cutoff suggests either a kernel panic or hardware failure that prevents disk writes.

Your dmesg output reveals critical RAID issues during boot:

[    3.624786] EXT3-fs (md1): error: couldn't mount because of unsupported optional features (240)
[    3.627095] EXT2-fs (md1): error: couldn't mount because of unsupported optional features (244)
[    3.630284] EXT4-fs (md1): INFO: recovery required on readonly filesystem

This indicates filesystem corruption on your RAID array. The "not clean" status suggests improper shutdowns, which could be both a symptom and cause of crashes.

Install these packages immediately:

sudo yum install -y sysstat crash kernel-devel mcelog

Enable kdump for capturing crash context:

sudo yum install -y kexec-tools
sudo systemctl enable kdump
sudo systemctl start kdump

Configure /etc/kdump.conf:

path /var/crash
core_collector makedumpfile -l --message-level 1 -d 31

Create a monitoring script (monitor.sh):

#!/bin/bash
while true; do
    echo "$(date) - $(cat /proc/meminfo | grep MemFree)" >> /var/log/mem_monitor.log
    dmesg -T | tail -n 20 >> /var/log/dmesg_monitor.log
    mdadm --detail /dev/md* >> /var/log/raid_status.log
    sleep 30
done

Check for hardware errors:

sudo mcelog --ascii
sudo smartctl -a /dev/sda

For kernel module issues:

sudo lsmod | grep md_mod
sudo modinfo md_mod

Check array consistency:

sudo mdadm --detail --scan
sudo mdadm --examine /dev/sd[a-c]

Force a resync if needed:

sudo mdadm --manage /dev/md1 --action=resync

Add these to /etc/default/grub:

GRUB_CMDLINE_LINUX="raid=noautodetect crashkernel=auto nmi_watchdog=0 softlockup_panic=1"

Remember to update GRUB:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

The dmesg output reveals critical filesystem issues with your RAID array configuration. The key indicators are:

[    3.624786] EXT3-fs (md1): error: couldn't mount because of unsupported optional features (240)
[    3.627095] EXT2-fs (md1): error: couldn't mount because of unsupported optional features (244)
[    3.630284] EXT4-fs (md1): INFO: recovery required on readonly filesystem

First, verify your RAID array status with these commands:

cat /proc/mdstat
mdadm --detail /dev/md1
mdadm --detail /dev/md2

To capture future crashes, configure kdump:

yum install kexec-tools
systemctl enable kdump.service
systemctl start kdump.service

Verify configuration with:

kdumpctl status

Install and configure sysstat for historical data:

yum install sysstat
sed -i 's/^HISTORY=.*/HISTORY=28/' /etc/sysconfig/sysstat
systemctl enable sysstat
systemctl start sysstat

For the EXT4 filesystem errors shown in your logs:

umount /dev/md1
fsck.ext4 -f /dev/md1
mount /dev/md1

Add these to /etc/sysctl.conf to improve stability:

vm.panic_on_oom = 1
kernel.panic = 10
kernel.sysrq = 1

Configure journald for persistent logs:

mkdir /var/log/journal
chown root:systemd-journal /var/log/journal
chmod 2755 /var/log/journal
systemctl restart systemd-journald

Install lm_sensors for hardware monitoring:

yum install lm_sensors
sensors-detect
systemctl start lm_sensors
systemctl enable lm_sensors

Create a monitoring script at /usr/local/bin/raid_monitor.sh:

#!/bin/bash
RAID_STATUS=$(mdadm --detail /dev/md1 | grep "State :" | awk '{print $3}')

if [ "$RAID_STATUS" != "clean" ]; then
    logger -t RAID "Degraded array detected, attempting repair"
    mdadm --manage /dev/md1 --add /dev/sda1
    mdadm --manage /dev/md1 --add /dev/sdb1
    mdadm --manage /dev/md1 --add /dev/sdc1
fi

Make it executable and add to cron:

chmod +x /usr/local/bin/raid_monitor.sh
(crontab -l ; echo "*/15 * * * * /usr/local/bin/raid_monitor.sh") | crontab -