How to Diagnose and Locate RAM Modules with Correctable ECC Errors (CE) in Linux Kernel Logs


14 views

When analyzing memory errors in Linux systems, the kernel's EDAC (Error Detection and Correction) subsystem logs detailed information about memory faults. In your case, the log shows:

kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

This indicates a Correctable Error (CE) occurring in:

  • Memory Controller 0 (MC0)
  • Channel 2
  • DIMM slot 0

The challenge lies in translating the EDAC channel/slot notation to physical memory module locations. For HP ProLiant servers, the mapping isn't always straightforward.

First, confirm the error location using sysfs:

grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:144648966

HP servers use a specific numbering scheme that differs from EDAC's channel-based reporting. Here's how to cross-reference:

dmidecode -t memory | grep -A5 "Locator: PROC"
Locator: PROC 1 DIMM 2A
    Size: 4096 MB
    Type: DDR3
    Speed: 1333 MHz
    Manufacturer: HP
    Part Number: 647649-071

For HP ProLiant DL180 G6 servers, use this mapping table:

EDAC Channel | HP Physical Slot
-------------------------------
Channel 0    | DIMM 1D (CPU1) / DIMM 1D (CPU2)
Channel 1    | DIMM 2A (CPU1) / DIMM 2A (CPU2)
Channel 2    | DIMM 3E (CPU1) / DIMM 3E (CPU2)

Create this bash script to automate the correlation:

#!/bin/bash
# Get EDAC error counts
echo "EDAC Error Counts:"
grep -H "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

# Get physical DIMM info
echo -e "\nPhysical DIMM Information:"
dmidecode -t memory | awk '/Locator: PROC|Size|Part Number/ {print} /^$/ {print "---"}'
  1. Reset the error counters: echo 1 | sudo tee /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
  2. Monitor for new errors
  3. Physically reseat the suspect DIMM
  4. Run memory tests: memtester 512M 1

For production systems, implement continuous monitoring:

# /etc/cron.hourly/memcheck
#!/bin/bash
ERRORS=$(grep -c "CE error" /var/log/kern.log)
if [ "$ERRORS" -gt 0 ]; then
    echo "Memory errors detected" | mail -s "EDAC Alert" admin@example.com
fi

When you see an EDAC error like this in your kernel logs:

kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

This indicates a Correctable Error (CE) occurred in memory channel 2, DIMM slot 0. The key components are:

  • MC0: Memory Controller 0
  • Channel#2: Memory channel 2
  • DIMM#0: Physical slot 0 on that channel

To confirm which physical DIMM is affected, we need to cross-reference several system sources:

# Check CE error counters
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

# Get physical memory layout
dmidecode -t memory

The challenge comes from vendor-specific channel numbering schemes. For HP ProLiant servers like DL180 G6, the mapping isn't always straightforward.

Here's a script to help map EDAC channels to physical slots:

#!/bin/bash
# Get memory controller info
echo "Memory Controller Layout:"
lspci | grep -i "memory controller"

# Get detailed channel mapping
echo -e "\nChannel to DIMM mapping:"
for mc in /sys/devices/system/edac/mc/mc*; do
  echo "MC $(basename $mc):"
  for csrow in $mc/csrow*; do
    for channel in $csrow/ch*; do
      count=$(cat $channel/*_ce_count)
      [ $count -gt 0 ] && echo "  $(basename $channel) errors: $count"
    done
  done
done

# Cross-reference with dmidecode
echo -e "\nPhysical DIMM slots:"
dmidecode -t memory | awk '/Locator: PROC/ {print $3,$4,$5}'

On HP ProLiant servers, the slot naming convention follows:

PROC X DIMM YZ
Where:
X = Processor socket (1 or 2)
Y = Channel number (1-6)
Z = Slot position (A-F)

For your case with Channel#2_DIMM#0, this likely maps to:

  • Channel 2 (HP numbering starts at 1, so EDAC channel 2 = HP channel 3)
  • DIMM 0 = First slot in channel (typically marked with letter A)
  1. Check the server's hardware maintenance guide for exact slot numbering
  2. Physically inspect DIMMs - most have LED indicators for errors
  3. Use HP's proprietary tools:
    # For RedHat/CentOS:
    yum install hp-health -y
    hpasmcli -s "show dimm"

For ongoing monitoring, consider this Nagios check script:

#!/bin/bash
WARNING=1
CRITICAL=10

CE_COUNT=$(grep -h "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count | awk -F: '{sum+=$2} END {print sum}')

if [ $CE_COUNT -ge $CRITICAL ]; then
  echo "CRITICAL: $CE_COUNT memory CE errors detected"
  exit 2
elif [ $CE_COUNT -ge $WARNING ]; then
  echo "WARNING: $CE_COUNT memory CE errors detected"
  exit 1
else
  echo "OK: $CE_COUNT memory CE errors"
  exit 0
fi