When analyzing memory errors in Linux systems, the kernel's EDAC (Error Detection and Correction) subsystem logs detailed information about memory faults. In your case, the log shows:
kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
This indicates a Correctable Error (CE) occurring in:
- Memory Controller 0 (MC0)
- Channel 2
- DIMM slot 0
The challenge lies in translating the EDAC channel/slot notation to physical memory module locations. For HP ProLiant servers, the mapping isn't always straightforward.
First, confirm the error location using sysfs:
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:144648966
HP servers use a specific numbering scheme that differs from EDAC's channel-based reporting. Here's how to cross-reference:
dmidecode -t memory | grep -A5 "Locator: PROC"
Locator: PROC 1 DIMM 2A
Size: 4096 MB
Type: DDR3
Speed: 1333 MHz
Manufacturer: HP
Part Number: 647649-071
For HP ProLiant DL180 G6 servers, use this mapping table:
EDAC Channel | HP Physical Slot
-------------------------------
Channel 0 | DIMM 1D (CPU1) / DIMM 1D (CPU2)
Channel 1 | DIMM 2A (CPU1) / DIMM 2A (CPU2)
Channel 2 | DIMM 3E (CPU1) / DIMM 3E (CPU2)
Create this bash script to automate the correlation:
#!/bin/bash
# Get EDAC error counts
echo "EDAC Error Counts:"
grep -H "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
# Get physical DIMM info
echo -e "\nPhysical DIMM Information:"
dmidecode -t memory | awk '/Locator: PROC|Size|Part Number/ {print} /^$/ {print "---"}'
- Reset the error counters:
echo 1 | sudo tee /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
- Monitor for new errors
- Physically reseat the suspect DIMM
- Run memory tests:
memtester 512M 1
For production systems, implement continuous monitoring:
# /etc/cron.hourly/memcheck
#!/bin/bash
ERRORS=$(grep -c "CE error" /var/log/kern.log)
if [ "$ERRORS" -gt 0 ]; then
echo "Memory errors detected" | mail -s "EDAC Alert" admin@example.com
fi
When you see an EDAC error like this in your kernel logs:
kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
This indicates a Correctable Error (CE) occurred in memory channel 2, DIMM slot 0. The key components are:
- MC0: Memory Controller 0
- Channel#2: Memory channel 2
- DIMM#0: Physical slot 0 on that channel
To confirm which physical DIMM is affected, we need to cross-reference several system sources:
# Check CE error counters
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
# Get physical memory layout
dmidecode -t memory
The challenge comes from vendor-specific channel numbering schemes. For HP ProLiant servers like DL180 G6, the mapping isn't always straightforward.
Here's a script to help map EDAC channels to physical slots:
#!/bin/bash
# Get memory controller info
echo "Memory Controller Layout:"
lspci | grep -i "memory controller"
# Get detailed channel mapping
echo -e "\nChannel to DIMM mapping:"
for mc in /sys/devices/system/edac/mc/mc*; do
echo "MC $(basename $mc):"
for csrow in $mc/csrow*; do
for channel in $csrow/ch*; do
count=$(cat $channel/*_ce_count)
[ $count -gt 0 ] && echo " $(basename $channel) errors: $count"
done
done
done
# Cross-reference with dmidecode
echo -e "\nPhysical DIMM slots:"
dmidecode -t memory | awk '/Locator: PROC/ {print $3,$4,$5}'
On HP ProLiant servers, the slot naming convention follows:
PROC X DIMM YZ
Where:
X = Processor socket (1 or 2)
Y = Channel number (1-6)
Z = Slot position (A-F)
For your case with Channel#2_DIMM#0, this likely maps to:
- Channel 2 (HP numbering starts at 1, so EDAC channel 2 = HP channel 3)
- DIMM 0 = First slot in channel (typically marked with letter A)
- Check the server's hardware maintenance guide for exact slot numbering
- Physically inspect DIMMs - most have LED indicators for errors
- Use HP's proprietary tools:
# For RedHat/CentOS: yum install hp-health -y hpasmcli -s "show dimm"
For ongoing monitoring, consider this Nagios check script:
#!/bin/bash
WARNING=1
CRITICAL=10
CE_COUNT=$(grep -h "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count | awk -F: '{sum+=$2} END {print sum}')
if [ $CE_COUNT -ge $CRITICAL ]; then
echo "CRITICAL: $CE_COUNT memory CE errors detected"
exit 2
elif [ $CE_COUNT -ge $WARNING ]; then
echo "WARNING: $CE_COUNT memory CE errors detected"
exit 1
else
echo "OK: $CE_COUNT memory CE errors"
exit 0
fi