How to Diagnose and Locate RAM Modules with Correctable ECC Errors (CE) in Linux Kernel Logs


2 views

When analyzing memory errors in Linux systems, the kernel's EDAC (Error Detection and Correction) subsystem logs detailed information about memory faults. In your case, the log shows:

kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

This indicates a Correctable Error (CE) occurring in:

  • Memory Controller 0 (MC0)
  • Channel 2
  • DIMM slot 0

The challenge lies in translating the EDAC channel/slot notation to physical memory module locations. For HP ProLiant servers, the mapping isn't always straightforward.

First, confirm the error location using sysfs:

grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:144648966

HP servers use a specific numbering scheme that differs from EDAC's channel-based reporting. Here's how to cross-reference:

dmidecode -t memory | grep -A5 "Locator: PROC"
Locator: PROC 1 DIMM 2A
    Size: 4096 MB
    Type: DDR3
    Speed: 1333 MHz
    Manufacturer: HP
    Part Number: 647649-071

For HP ProLiant DL180 G6 servers, use this mapping table:

EDAC Channel | HP Physical Slot
-------------------------------
Channel 0    | DIMM 1D (CPU1) / DIMM 1D (CPU2)
Channel 1    | DIMM 2A (CPU1) / DIMM 2A (CPU2)
Channel 2    | DIMM 3E (CPU1) / DIMM 3E (CPU2)

Create this bash script to automate the correlation:

#!/bin/bash
# Get EDAC error counts
echo "EDAC Error Counts:"
grep -H "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

# Get physical DIMM info
echo -e "\nPhysical DIMM Information:"
dmidecode -t memory | awk '/Locator: PROC|Size|Part Number/ {print} /^$/ {print "---"}'
  1. Reset the error counters: echo 1 | sudo tee /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
  2. Monitor for new errors
  3. Physically reseat the suspect DIMM
  4. Run memory tests: memtester 512M 1

For production systems, implement continuous monitoring:

# /etc/cron.hourly/memcheck
#!/bin/bash
ERRORS=$(grep -c "CE error" /var/log/kern.log)
if [ "$ERRORS" -gt 0 ]; then
    echo "Memory errors detected" | mail -s "EDAC Alert" admin@example.com
fi

When you see an EDAC error like this in your kernel logs:

kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

This indicates a Correctable Error (CE) occurred in memory channel 2, DIMM slot 0. The key components are:

  • MC0: Memory Controller 0
  • Channel#2: Memory channel 2
  • DIMM#0: Physical slot 0 on that channel

To confirm which physical DIMM is affected, we need to cross-reference several system sources:

# Check CE error counters
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

# Get physical memory layout
dmidecode -t memory

The challenge comes from vendor-specific channel numbering schemes. For HP ProLiant servers like DL180 G6, the mapping isn't always straightforward.

Here's a script to help map EDAC channels to physical slots:

#!/bin/bash
# Get memory controller info
echo "Memory Controller Layout:"
lspci | grep -i "memory controller"

# Get detailed channel mapping
echo -e "\nChannel to DIMM mapping:"
for mc in /sys/devices/system/edac/mc/mc*; do
  echo "MC $(basename $mc):"
  for csrow in $mc/csrow*; do
    for channel in $csrow/ch*; do
      count=$(cat $channel/*_ce_count)
      [ $count -gt 0 ] && echo "  $(basename $channel) errors: $count"
    done
  done
done

# Cross-reference with dmidecode
echo -e "\nPhysical DIMM slots:"
dmidecode -t memory | awk '/Locator: PROC/ {print $3,$4,$5}'

On HP ProLiant servers, the slot naming convention follows:

PROC X DIMM YZ
Where:
X = Processor socket (1 or 2)
Y = Channel number (1-6)
Z = Slot position (A-F)

For your case with Channel#2_DIMM#0, this likely maps to:

  • Channel 2 (HP numbering starts at 1, so EDAC channel 2 = HP channel 3)
  • DIMM 0 = First slot in channel (typically marked with letter A)
  1. Check the server's hardware maintenance guide for exact slot numbering
  2. Physically inspect DIMMs - most have LED indicators for errors
  3. Use HP's proprietary tools:
    # For RedHat/CentOS:
    yum install hp-health -y
    hpasmcli -s "show dimm"

For ongoing monitoring, consider this Nagios check script:

#!/bin/bash
WARNING=1
CRITICAL=10

CE_COUNT=$(grep -h "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count | awk -F: '{sum+=$2} END {print sum}')

if [ $CE_COUNT -ge $CRITICAL ]; then
  echo "CRITICAL: $CE_COUNT memory CE errors detected"
  exit 2
elif [ $CE_COUNT -ge $WARNING ]; then
  echo "WARNING: $CE_COUNT memory CE errors detected"
  exit 1
else
  echo "OK: $CE_COUNT memory CE errors"
  exit 0
fi