When dealing with Machine Check Exception (MCE) logs, it's crucial to understand the hierarchy of information presented. The log snippet shows several key components:
MCE 0
CPU 0 BANK 8
MISC 640738dd0009159c ADDR 96236c6c0
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Let's break down the most important fields in the MCE output:
CPU 0 BANK 8 → Indicates memory controller bank 8 on CPU0 is reporting errors
ADDR 96236c6c0 → Physical memory address where error occurred
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR → Specifies a read channel error
The mcelog daemon provides additional context through its memory error reporting. From the output:
Per page corrected memory statistics:
96236c000: total 20 20 in 24h online triggered
96fb6c000: total 15 15 in 24h online triggered
This shows specific memory pages with high error counts, indicating potential hardware issues.
Here's a script to automatically parse and analyze MCE logs:
#!/bin/bash
# Parse /var/log/messages for MCE errors
awk '/mcelog: CPU/ {cpu=$2; bank=$4}
/mcelog: ADDR/ {split($0,a," "); addr=a[3]}
/mcelog: MCA:/ {print "Error detected: CPU " cpu " BANK " bank
print "Memory address: " addr
print "Error type: " $3 $4}' /var/log/messages
For Intel Xeon processors (like the X5670 in this case), you can decode the CPU model specifics:
cat /proc/cpuinfo | grep -A 5 'model name'
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
To correlate MCE errors with specific DIMMs, use this enhanced parsing script:
#!/usr/bin/python3
import re
log_data = open('/var/log/messages').read()
mce_pattern = re.compile(r'CPU (\d+) BANK (\d+).*?ADDR ([0-9a-f]+)', re.DOTALL)
for match in mce_pattern.finditer(log_data):
cpu, bank, addr = match.groups()
print(f"Hardware error detected: CPU{cpu} Bank{bank} at 0x{addr}")
# Additional logic to map address to physical DIMM would go here
Based on the output showing 77 corrected errors and multiple triggered pages in 24 hours, this suggests:
- Sustained correctable errors indicate memory degradation
- Multiple triggered pages point to potential DIMM failure
- The concentration in BANK 8 suggests a specific memory controller issue
When analyzing MCE errors in /var/log/messages
, focus on these critical components:
CPU 0 BANK 8
MISC 640738dd0009159c ADDR 96236c6c0
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
STATUS 8c0000400001009f
Here's what each field means in technical terms:
- BANK 8: Indicates memory controller bank (physical location)
- ADDR field: Physical memory address where error occurred
- MCA type: Reveals error category (MEMORY CONTROLLER RD_CHANNEL)
- STATUS code: Contains error severity flags (8c0000400001009f)
Use these commands to gather additional context:
# Check CPU microarchitecture details
grep -m1 'model name' /proc/cpuinfo
# Filter MCE bank errors (example for Bank 8)
mcelog --filter bank=8
# Decode raw MCE status registers
echo '8c0000400001009f' | mcelog --ascii
The mcelog
memory statistics reveal important patterns. Notice the "triggered" flags indicate particularly problematic pages:
96236c000: total 20 20 in 24h online triggered
96fb6c000: total 15 15 in 24h online triggered
9c2edc000: total 15 15 in 24h online triggered
Create a monitoring script like this to catch MCE events:
#!/bin/bash
# MCE monitoring script
LOG=/var/log/mcelog-monitor.log
THRESHOLD=5
check_mce() {
local count=$(mcelog --memory-errors | grep -c 'triggered')
if [[ $count -ge $THRESHOLD ]]; then
echo "[$(date)] WARNING: $count triggered MCE events detected" >> $LOG
mcelog --memory-errors >> $LOG
return 1
fi
return 0
}
while true; do
check_mce
sleep 300
done
When dmesg
, mcelog
, and syslog show different counts:
dmesg
shows kernel notifications (may be rate-limited)mcelog
shows processed events (more detailed)- Check
/sys/devices/system/machinecheck/machinecheck*/check_interval
for polling settings
For Intel Xeon X5670 processors (Nehalem-EP architecture):
CPUID Vendor Intel Family 6 Model 44
Key characteristics:
- Integrated memory controller (IMC)
- Three-channel DDR3 memory architecture
- BANK values correspond to memory controller channels