Monitoring ECC Memory Errors in Linux: Best Practices for Alerts and Logs


2 views

When working with ECC (Error-Correcting Code) memory in Linux systems, error reporting primarily comes through these channels:

1. Kernel ring buffer (dmesg)
2. mcelog daemon processing
3. EDAC (Error Detection and Correction) subsystem
4. Hardware-specific interfaces like IPMI

For basic monitoring, check your kernel messages for EDAC-related entries:

# Check recent ECC events
dmesg | grep -i -e "ECC" -e "memory" -e "corrected"

# Persistent logging (Ubuntu/Debian)
grep -i -e "EDAC" -e "ECC" /var/log/syslog

Install and configure the mcelog daemon for comprehensive error tracking:

# Installation
sudo apt install mcelog

# Configuration (edit /etc/mcelog/mcelog.conf)
daemon = yes
filter = yes
syslog = yes

# Service restart
systemctl enable --now mcelog

The Linux kernel's EDAC subsystem provides detailed memory error information. Check your modules:

# List loaded EDAC modules
lsmod | grep edac

# View EDAC counters (replace sb_edac with your chipset module)
grep . /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
grep . /sys/devices/system/edac/mc/mc*/csrow*/ch*_ue_count

For production monitoring, create a custom check:

#!/bin/bash
# nagios_check_ecc.sh

CE_COUNT=$(grep -c "corrected memory errors" /var/log/syslog)
UE_COUNT=$(grep -c "uncorrected memory errors" /var/log/syslog)

if [ $UE_COUNT -gt 0 ]; then
    echo "CRITICAL: $UE_COUNT uncorrected ECC errors detected"
    exit 2
elif [ $CE_COUNT -gt 10 ]; then
    echo "WARNING: $CE_COUNT corrected ECC errors detected"
    exit 1
else
    echo "OK: No critical ECC errors detected"
    exit 0
fi

For Supermicro X9SCM-F motherboards with Sandy Bridge architecture:

# Load required modules
modprobe sb_edac
modprobe i7core_edac

# Verify EDAC support
edac-util --status

For modern systems, consider rasdaemon for reliable error reporting:

# Installation
sudo apt install rasdaemon

# Service management
systemctl enable --now rasdaemon

# Query errors
ras-mc-ctl --errors
ras-mc-ctl --summary

Create a cron job for regular ECC checks:

# /etc/cron.d/ecc-monitor
*/5 * * * * root /usr/local/bin/ecc_monitor.sh 2>&1 | mail -s "ECC Report" admin@example.com

Sample monitoring script:

#!/bin/bash
# ecc_monitor.sh

LOG="/var/log/ecc_errors.log"
THRESHOLD=5

# Get corrected error count
CE_COUNT=$(edac-util | awk '/corrected/ {print $2}')

if [ $CE_COUNT -gt $THRESHOLD ]; then
    echo "[$(date)] WARNING: $CE_COUNT corrected ECC errors" >> $LOG
fi

For long-term tracking, consider these tools:

1. Prometheus + EDAC exporter
2. Grafana dashboards with EDAC metrics
3. Custom scripts logging to time-series databases

When working with server-grade hardware like Supermicro X9SCM-F boards, ECC (Error-Correcting Code) memory errors can be reported through multiple channels. The Linux kernel typically logs these events via the EDAC (Error Detection and Correction) subsystem.

Here are the common log patterns to look for in /var/log/syslog or dmesg:

# Correctable Error (CE)
EDAC MC0: CE row 0, channel 0, offset 0, grain 8, syndrome 0x0, label "": Corrected error

# Uncorrectable Error (UE)
EDAC MC0: UE row 0, channel 1, offset 0, grain 8, label "": Uncorrected error

The most straightforward approach is using the edac-utils package:

sudo apt-get install edac-utils
sudo /etc/init.d/edac start

Configure log monitoring in /etc/edac/logging.conf:

[Logging]
log_ue = yes
log_ce = yes
log_syslog = yes
log_threshold = 1

Create a custom check script (/usr/lib/nagios/plugins/check_ecc):

#!/bin/bash
CE_COUNT=$(grep -c "Corrected error" /var/log/syslog)
UE_COUNT=$(grep -c "Uncorrected error" /var/log/syslog)

if [ $UE_COUNT -gt 0 ]; then
  echo "CRITICAL: $UE_COUNT uncorrected ECC errors detected!"
  exit 2
elif [ $CE_COUNT -gt 10 ]; then
  echo "WARNING: $CE_COUNT corrected ECC errors detected"
  exit 1
else
  echo "OK: No critical ECC errors detected"
  exit 0
fi

For more detailed reporting, install and configure mcelog:

sudo apt-get install mcelog
sudo service mcelog start

Configure triggers in /etc/mcelog/mcelog.conf:

[trigger]
directory = /etc/mcelog/triggers

[filter]
memory = yes

Add this to your /etc/rsyslog.conf to trigger emails for critical errors:

:msg, contains, "Uncorrected error" ^/path/to/alert_script.sh

Example alert script (/path/to/alert_script.sh):

#!/bin/bash
read msg
echo "$msg" | mail -s "CRITICAL ECC ERROR DETECTED" admin@example.com