Debugging Unexplained CentOS 7 Server Crashes: RAID, Kernel Logs, and System Diagnostics


1 views

Server crashes without logs are like crime scenes without fingerprints. From your description, the system fails so hard that even kernel logging stops mid-stream. The /var/log/messages cutoff suggests either a kernel panic or hardware failure that prevents disk writes.

Your dmesg output reveals critical RAID issues during boot:

[    3.624786] EXT3-fs (md1): error: couldn't mount because of unsupported optional features (240)
[    3.627095] EXT2-fs (md1): error: couldn't mount because of unsupported optional features (244)
[    3.630284] EXT4-fs (md1): INFO: recovery required on readonly filesystem

This indicates filesystem corruption on your RAID array. The "not clean" status suggests improper shutdowns, which could be both a symptom and cause of crashes.

Install these packages immediately:

sudo yum install -y sysstat crash kernel-devel mcelog

Enable kdump for capturing crash context:

sudo yum install -y kexec-tools
sudo systemctl enable kdump
sudo systemctl start kdump

Configure /etc/kdump.conf:

path /var/crash
core_collector makedumpfile -l --message-level 1 -d 31

Create a monitoring script (monitor.sh):

#!/bin/bash
while true; do
    echo "$(date) - $(cat /proc/meminfo | grep MemFree)" >> /var/log/mem_monitor.log
    dmesg -T | tail -n 20 >> /var/log/dmesg_monitor.log
    mdadm --detail /dev/md* >> /var/log/raid_status.log
    sleep 30
done

Check for hardware errors:

sudo mcelog --ascii
sudo smartctl -a /dev/sda

For kernel module issues:

sudo lsmod | grep md_mod
sudo modinfo md_mod

Check array consistency:

sudo mdadm --detail --scan
sudo mdadm --examine /dev/sd[a-c]

Force a resync if needed:

sudo mdadm --manage /dev/md1 --action=resync

Add these to /etc/default/grub:

GRUB_CMDLINE_LINUX="raid=noautodetect crashkernel=auto nmi_watchdog=0 softlockup_panic=1"

Remember to update GRUB:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

The dmesg output reveals critical filesystem issues with your RAID array configuration. The key indicators are:

[    3.624786] EXT3-fs (md1): error: couldn't mount because of unsupported optional features (240)
[    3.627095] EXT2-fs (md1): error: couldn't mount because of unsupported optional features (244)
[    3.630284] EXT4-fs (md1): INFO: recovery required on readonly filesystem

First, verify your RAID array status with these commands:

cat /proc/mdstat
mdadm --detail /dev/md1
mdadm --detail /dev/md2

To capture future crashes, configure kdump:

yum install kexec-tools
systemctl enable kdump.service
systemctl start kdump.service

Verify configuration with:

kdumpctl status

Install and configure sysstat for historical data:

yum install sysstat
sed -i 's/^HISTORY=.*/HISTORY=28/' /etc/sysconfig/sysstat
systemctl enable sysstat
systemctl start sysstat

For the EXT4 filesystem errors shown in your logs:

umount /dev/md1
fsck.ext4 -f /dev/md1
mount /dev/md1

Add these to /etc/sysctl.conf to improve stability:

vm.panic_on_oom = 1
kernel.panic = 10
kernel.sysrq = 1

Configure journald for persistent logs:

mkdir /var/log/journal
chown root:systemd-journal /var/log/journal
chmod 2755 /var/log/journal
systemctl restart systemd-journald

Install lm_sensors for hardware monitoring:

yum install lm_sensors
sensors-detect
systemctl start lm_sensors
systemctl enable lm_sensors

Create a monitoring script at /usr/local/bin/raid_monitor.sh:

#!/bin/bash
RAID_STATUS=$(mdadm --detail /dev/md1 | grep "State :" | awk '{print $3}')

if [ "$RAID_STATUS" != "clean" ]; then
    logger -t RAID "Degraded array detected, attempting repair"
    mdadm --manage /dev/md1 --add /dev/sda1
    mdadm --manage /dev/md1 --add /dev/sdb1
    mdadm --manage /dev/md1 --add /dev/sdc1
fi

Make it executable and add to cron:

chmod +x /usr/local/bin/raid_monitor.sh
(crontab -l ; echo "*/15 * * * * /usr/local/bin/raid_monitor.sh") | crontab -