How to Diagnose and Replace Faulty RAM Using MCE Logs in Linux Servers

When your Linux server reports Machine Check Exceptions (MCE) in kernel logs, it's crucial to decode the information properly. In your case, we have this critical information:

CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090
TSC 0 ADDR 4a0d75900 MISC 21405cdc86 PROCESSOR 0:206d7 TIME 1428957562 SOCKET 0 APIC 0
EDAC MC0: 1 CE memory read error

Key components to analyze:

Bank 5: Indicates the memory controller channel
ADDR 4a0d75900: Physical memory address where error occurred
SOCKET 0: Points to CPU socket 0 (first processor)
EDAC MC0: Error Detection and Correction reporting Correctable Error

For Intel Xeon E5 processors, use this methodology:

# First install necessary tools
sudo apt-get install edac-utils dmidecode

# Get detailed memory mapping
sudo decode-dimms

From your lshw output, we see populated slots for CPU0 (Socket 0):

P1_DIMMA1 (Bank 0)
P1_DIMMB1 (Bank 2)
P1_DIMMC1 (Bank 4)
P1_DIMMD1 (Bank 6)

The physical address 4a0d75900 needs translation. Use this script:

#!/bin/bash
PHYS_ADDR="4a0d75900"
sudo python3 -c "
import os
with open('/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count') as f:
    print('Channel 0 errors:', f.read().strip())
with open('/proc/bus/pci/00/00.0') as f:
    print('Memory controller info:', hex(int.from_bytes(f.read(4), 'little')))"

Execute these commands to gather more diagnostic data:

# Check all memory errors
sudo grep -i "memory" /var/log/kern.log

# View detailed EDAC information
sudo edac-util -v

# Check current memory usage patterns
sudo cat /proc/meminfo | grep -E 'MemTotal|MemFree|Buffers|Cached'

Based on the Bank 5 reference and your system's populated DIMM configuration, the error likely stems from either:

A memory controller issue (since Bank 5 isn't physically populated)
Address mapping pointing to one of the populated banks (possibly DIMMC1 or DIMMD1)

For Dell/HP/Supermicro servers with similar configurations:

# Before replacing hardware, test with memtester
sudo apt-get install memtester
sudo memtester 1G 5

# If errors persist, follow this replacement sequence:
1. Power down server
2. Remove CPU0 DIMMC1 (slot P1_DIMMC1)
3. Boot with reduced configuration
4. Monitor for errors
5. If errors continue, replace DIMMD1

Remember that in some Intel architectures, memory banks may be logically mapped differently than physically labeled. Consult your motherboard manual for exact channel-to-DIMM mapping.

When analyzing your MCE logs, the key indicators are:

CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090
TSC 0 ADDR 4a0d75900 MISC 21405cdc86 PROCESSOR 0:206d7 TIME 1428957562 SOCKET 0 APIC 0
EDAC MC0: 1 CE memory read error

The critical components here are:

Bank 5: Indicates the MCA bank reporting the error
ADDR 4a0d75900: Physical memory address where error occurred
SOCKET 0: Points to CPU socket 0 (first processor)
CE: Correctable Error (as opposed to UE/Uncorrectable Error)

While the bank number in MCE doesn't directly correspond to physical DIMM slots, we can decode the address:

# First convert physical address to system address
echo "ibase=16; 4a0d75900" | bc
19918985472

# Then use decode-dimms to map to channel/slot
sudo decode-dimms | grep -A10 "Memory Device"

For Intel Xeon E5 systems, the physical address breakdown is:

Bits [28:0] - Cache line address
Bits [31:29] - Channel ID
Bits [33:32] - Slot number (0-3)

Use edac-util to get more detailed DIMM information:

sudo edac-util --verbose
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 1 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: Last Error: Page 0x4a0d75, offset 0x0, grain 8, syndrome 0x0

This indicates the error occurred on:

CPU Socket 0
Channel 3
DIMM slot 0

Cross-reference with physical memory mapping:

sudo dmidecode -t memory | grep -A5 "Memory Device"

In your case, comparing with the lshw output shows the affected DIMM is likely:

*-bank:6
     description: DIMM DDR3 1333 MHz (0,8 ns)
     product: 9965516-048.A
     vendor: Kingston
     physical id: 6
     serial: E7305738
     slot: P1_DIMMD1

For deeper analysis, use these commands:

# Check all MCE records
sudo mcelog --ascii

# Check memory controller status
sudo turbostat --show CPU,Busy%,Bzy_MHz,PkgTmp,RAMWatt,RAMJoule -i 5

# Stress test specific DIMMs
sudo memtester 4G 5 | grep "FAIL"

Mark the suspect DIMM (serial E7305738) as potentially faulty
Run extended memory tests on this specific module:

sudo memtester -p 0x4a0d75900 8M 10

Monitor for additional errors over 24-48 hours

If errors persist, replace the DIMM following vendor guidelines

ServerDevWorker

How to Diagnose and Replace Faulty RAM Using MCE Logs in Linux Servers

Related Articles