Decoding MCE Logs: How to Diagnose Memory Controller Read Errors in Linux Systems


13 views

When dealing with Machine Check Exception (MCE) logs, it's crucial to understand the hierarchy of information presented. The log snippet shows several key components:

MCE 0
CPU 0 BANK 8
MISC 640738dd0009159c ADDR 96236c6c0
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error

Let's break down the most important fields in the MCE output:

CPU 0 BANK 8 → Indicates memory controller bank 8 on CPU0 is reporting errors
ADDR 96236c6c0 → Physical memory address where error occurred
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR → Specifies a read channel error

The mcelog daemon provides additional context through its memory error reporting. From the output:

Per page corrected memory statistics:
96236c000: total 20 20 in 24h online triggered
96fb6c000: total 15 15 in 24h online triggered

This shows specific memory pages with high error counts, indicating potential hardware issues.

Here's a script to automatically parse and analyze MCE logs:

#!/bin/bash
# Parse /var/log/messages for MCE errors
awk '/mcelog: CPU/ {cpu=$2; bank=$4}
     /mcelog: ADDR/ {split($0,a," "); addr=a[3]}
     /mcelog: MCA:/ {print "Error detected: CPU " cpu " BANK " bank
                     print "Memory address: " addr
                     print "Error type: " $3 $4}' /var/log/messages

For Intel Xeon processors (like the X5670 in this case), you can decode the CPU model specifics:

cat /proc/cpuinfo | grep -A 5 'model name'
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz

To correlate MCE errors with specific DIMMs, use this enhanced parsing script:

#!/usr/bin/python3
import re

log_data = open('/var/log/messages').read()
mce_pattern = re.compile(r'CPU (\d+) BANK (\d+).*?ADDR ([0-9a-f]+)', re.DOTALL)

for match in mce_pattern.finditer(log_data):
    cpu, bank, addr = match.groups()
    print(f"Hardware error detected: CPU{cpu} Bank{bank} at 0x{addr}")
    # Additional logic to map address to physical DIMM would go here

Based on the output showing 77 corrected errors and multiple triggered pages in 24 hours, this suggests:

  • Sustained correctable errors indicate memory degradation
  • Multiple triggered pages point to potential DIMM failure
  • The concentration in BANK 8 suggests a specific memory controller issue

When analyzing MCE errors in /var/log/messages, focus on these critical components:

CPU 0 BANK 8
MISC 640738dd0009159c ADDR 96236c6c0
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
STATUS 8c0000400001009f

Here's what each field means in technical terms:

  • BANK 8: Indicates memory controller bank (physical location)
  • ADDR field: Physical memory address where error occurred
  • MCA type: Reveals error category (MEMORY CONTROLLER RD_CHANNEL)
  • STATUS code: Contains error severity flags (8c0000400001009f)

Use these commands to gather additional context:

# Check CPU microarchitecture details
grep -m1 'model name' /proc/cpuinfo

# Filter MCE bank errors (example for Bank 8)
mcelog --filter bank=8

# Decode raw MCE status registers
echo '8c0000400001009f' | mcelog --ascii

The mcelog memory statistics reveal important patterns. Notice the "triggered" flags indicate particularly problematic pages:

96236c000: total 20 20 in 24h online triggered
96fb6c000: total 15 15 in 24h online triggered
9c2edc000: total 15 15 in 24h online triggered

Create a monitoring script like this to catch MCE events:

#!/bin/bash
# MCE monitoring script
LOG=/var/log/mcelog-monitor.log
THRESHOLD=5

check_mce() {
    local count=$(mcelog --memory-errors | grep -c 'triggered')
    if [[ $count -ge $THRESHOLD ]]; then
        echo "[$(date)] WARNING: $count triggered MCE events detected" >> $LOG
        mcelog --memory-errors >> $LOG
        return 1
    fi
    return 0
}

while true; do
    check_mce
    sleep 300
done

When dmesg, mcelog, and syslog show different counts:

  • dmesg shows kernel notifications (may be rate-limited)
  • mcelog shows processed events (more detailed)
  • Check /sys/devices/system/machinecheck/machinecheck*/check_interval for polling settings

For Intel Xeon X5670 processors (Nehalem-EP architecture):

CPUID Vendor Intel Family 6 Model 44

Key characteristics:

  • Integrated memory controller (IMC)
  • Three-channel DDR3 memory architecture
  • BANK values correspond to memory controller channels