When working with ECC (Error-Correcting Code) memory in Linux systems, error reporting primarily comes through these channels:
1. Kernel ring buffer (dmesg) 2. mcelog daemon processing 3. EDAC (Error Detection and Correction) subsystem 4. Hardware-specific interfaces like IPMI
For basic monitoring, check your kernel messages for EDAC-related entries:
# Check recent ECC events dmesg | grep -i -e "ECC" -e "memory" -e "corrected" # Persistent logging (Ubuntu/Debian) grep -i -e "EDAC" -e "ECC" /var/log/syslog
Install and configure the mcelog daemon for comprehensive error tracking:
# Installation sudo apt install mcelog # Configuration (edit /etc/mcelog/mcelog.conf) daemon = yes filter = yes syslog = yes # Service restart systemctl enable --now mcelog
The Linux kernel's EDAC subsystem provides detailed memory error information. Check your modules:
# List loaded EDAC modules lsmod | grep edac # View EDAC counters (replace sb_edac with your chipset module) grep . /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count grep . /sys/devices/system/edac/mc/mc*/csrow*/ch*_ue_count
For production monitoring, create a custom check:
#!/bin/bash # nagios_check_ecc.sh CE_COUNT=$(grep -c "corrected memory errors" /var/log/syslog) UE_COUNT=$(grep -c "uncorrected memory errors" /var/log/syslog) if [ $UE_COUNT -gt 0 ]; then echo "CRITICAL: $UE_COUNT uncorrected ECC errors detected" exit 2 elif [ $CE_COUNT -gt 10 ]; then echo "WARNING: $CE_COUNT corrected ECC errors detected" exit 1 else echo "OK: No critical ECC errors detected" exit 0 fi
For Supermicro X9SCM-F motherboards with Sandy Bridge architecture:
# Load required modules modprobe sb_edac modprobe i7core_edac # Verify EDAC support edac-util --status
For modern systems, consider rasdaemon for reliable error reporting:
# Installation sudo apt install rasdaemon # Service management systemctl enable --now rasdaemon # Query errors ras-mc-ctl --errors ras-mc-ctl --summary
Create a cron job for regular ECC checks:
# /etc/cron.d/ecc-monitor */5 * * * * root /usr/local/bin/ecc_monitor.sh 2>&1 | mail -s "ECC Report" admin@example.com
Sample monitoring script:
#!/bin/bash # ecc_monitor.sh LOG="/var/log/ecc_errors.log" THRESHOLD=5 # Get corrected error count CE_COUNT=$(edac-util | awk '/corrected/ {print $2}') if [ $CE_COUNT -gt $THRESHOLD ]; then echo "[$(date)] WARNING: $CE_COUNT corrected ECC errors" >> $LOG fi
For long-term tracking, consider these tools:
1. Prometheus + EDAC exporter 2. Grafana dashboards with EDAC metrics 3. Custom scripts logging to time-series databases
When working with server-grade hardware like Supermicro X9SCM-F boards, ECC (Error-Correcting Code) memory errors can be reported through multiple channels. The Linux kernel typically logs these events via the EDAC (Error Detection and Correction) subsystem.
Here are the common log patterns to look for in /var/log/syslog
or dmesg
:
# Correctable Error (CE) EDAC MC0: CE row 0, channel 0, offset 0, grain 8, syndrome 0x0, label "": Corrected error # Uncorrectable Error (UE) EDAC MC0: UE row 0, channel 1, offset 0, grain 8, label "": Uncorrected error
The most straightforward approach is using the edac-utils
package:
sudo apt-get install edac-utils sudo /etc/init.d/edac start
Configure log monitoring in /etc/edac/logging.conf
:
[Logging] log_ue = yes log_ce = yes log_syslog = yes log_threshold = 1
Create a custom check script (/usr/lib/nagios/plugins/check_ecc
):
#!/bin/bash CE_COUNT=$(grep -c "Corrected error" /var/log/syslog) UE_COUNT=$(grep -c "Uncorrected error" /var/log/syslog) if [ $UE_COUNT -gt 0 ]; then echo "CRITICAL: $UE_COUNT uncorrected ECC errors detected!" exit 2 elif [ $CE_COUNT -gt 10 ]; then echo "WARNING: $CE_COUNT corrected ECC errors detected" exit 1 else echo "OK: No critical ECC errors detected" exit 0 fi
For more detailed reporting, install and configure mcelog
:
sudo apt-get install mcelog sudo service mcelog start
Configure triggers in /etc/mcelog/mcelog.conf
:
[trigger] directory = /etc/mcelog/triggers [filter] memory = yes
Add this to your /etc/rsyslog.conf
to trigger emails for critical errors:
:msg, contains, "Uncorrected error" ^/path/to/alert_script.sh
Example alert script (/path/to/alert_script.sh
):
#!/bin/bash read msg echo "$msg" | mail -s "CRITICAL ECC ERROR DETECTED" admin@example.com