Decoding MCE Logs: How to Diagnose Memory Controller Read Errors in Linux Systems


4 views

When dealing with Machine Check Exception (MCE) logs, it's crucial to understand the hierarchy of information presented. The log snippet shows several key components:

MCE 0
CPU 0 BANK 8
MISC 640738dd0009159c ADDR 96236c6c0
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error

Let's break down the most important fields in the MCE output:

CPU 0 BANK 8 → Indicates memory controller bank 8 on CPU0 is reporting errors
ADDR 96236c6c0 → Physical memory address where error occurred
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR → Specifies a read channel error

The mcelog daemon provides additional context through its memory error reporting. From the output:

Per page corrected memory statistics:
96236c000: total 20 20 in 24h online triggered
96fb6c000: total 15 15 in 24h online triggered

This shows specific memory pages with high error counts, indicating potential hardware issues.

Here's a script to automatically parse and analyze MCE logs:

#!/bin/bash
# Parse /var/log/messages for MCE errors
awk '/mcelog: CPU/ {cpu=$2; bank=$4}
     /mcelog: ADDR/ {split($0,a," "); addr=a[3]}
     /mcelog: MCA:/ {print "Error detected: CPU " cpu " BANK " bank
                     print "Memory address: " addr
                     print "Error type: " $3 $4}' /var/log/messages

For Intel Xeon processors (like the X5670 in this case), you can decode the CPU model specifics:

cat /proc/cpuinfo | grep -A 5 'model name'
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz

To correlate MCE errors with specific DIMMs, use this enhanced parsing script:

#!/usr/bin/python3
import re

log_data = open('/var/log/messages').read()
mce_pattern = re.compile(r'CPU (\d+) BANK (\d+).*?ADDR ([0-9a-f]+)', re.DOTALL)

for match in mce_pattern.finditer(log_data):
    cpu, bank, addr = match.groups()
    print(f"Hardware error detected: CPU{cpu} Bank{bank} at 0x{addr}")
    # Additional logic to map address to physical DIMM would go here

Based on the output showing 77 corrected errors and multiple triggered pages in 24 hours, this suggests:

  • Sustained correctable errors indicate memory degradation
  • Multiple triggered pages point to potential DIMM failure
  • The concentration in BANK 8 suggests a specific memory controller issue

When analyzing MCE errors in /var/log/messages, focus on these critical components:

CPU 0 BANK 8
MISC 640738dd0009159c ADDR 96236c6c0
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
STATUS 8c0000400001009f

Here's what each field means in technical terms:

  • BANK 8: Indicates memory controller bank (physical location)
  • ADDR field: Physical memory address where error occurred
  • MCA type: Reveals error category (MEMORY CONTROLLER RD_CHANNEL)
  • STATUS code: Contains error severity flags (8c0000400001009f)

Use these commands to gather additional context:

# Check CPU microarchitecture details
grep -m1 'model name' /proc/cpuinfo

# Filter MCE bank errors (example for Bank 8)
mcelog --filter bank=8

# Decode raw MCE status registers
echo '8c0000400001009f' | mcelog --ascii

The mcelog memory statistics reveal important patterns. Notice the "triggered" flags indicate particularly problematic pages:

96236c000: total 20 20 in 24h online triggered
96fb6c000: total 15 15 in 24h online triggered
9c2edc000: total 15 15 in 24h online triggered

Create a monitoring script like this to catch MCE events:

#!/bin/bash
# MCE monitoring script
LOG=/var/log/mcelog-monitor.log
THRESHOLD=5

check_mce() {
    local count=$(mcelog --memory-errors | grep -c 'triggered')
    if [[ $count -ge $THRESHOLD ]]; then
        echo "[$(date)] WARNING: $count triggered MCE events detected" >> $LOG
        mcelog --memory-errors >> $LOG
        return 1
    fi
    return 0
}

while true; do
    check_mce
    sleep 300
done

When dmesg, mcelog, and syslog show different counts:

  • dmesg shows kernel notifications (may be rate-limited)
  • mcelog shows processed events (more detailed)
  • Check /sys/devices/system/machinecheck/machinecheck*/check_interval for polling settings

For Intel Xeon X5670 processors (Nehalem-EP architecture):

CPUID Vendor Intel Family 6 Model 44

Key characteristics:

  • Integrated memory controller (IMC)
  • Three-channel DDR3 memory architecture
  • BANK values correspond to memory controller channels