When dealing with ECC Chipkill errors on HP servers running RHEL 5, the kernel messages follow a specific pattern that can help pinpoint the issue:
kernel: EDAC k8 MC0: general bus error: participating processor(local node response) kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error
The key elements in these messages that can help identify the faulty DIMM include:
- MC0 - Memory Controller 0
- Channel 0 - The memory channel where the error occurred
- Row 2 - The physical rank/bank of the memory module
For production systems where downtime isn't an option, consider these Linux tools:
# Install edac-utils if not already present yum install edac-utils # Check memory error counters edac-util -v # For more detailed information: grep -i "memory" /var/log/messages
HP servers typically follow this memory slot numbering convention:
Memory Controller 0: Channel 0: DIMM 1A, 2A, 3A Channel 1: DIMM 1B, 2B, 3B Memory Controller 1: Channel 0: DIMM 4A, 5A, 6A Channel 1: DIMM 4B, 5B, 6B
Here's a Python script to help parse the logs and suggest the likely faulty DIMM:
import re def parse_edac_log(log_line): patterns = { 'controller': r'MC(\d+)', 'channel': r'channel (\d+)', 'row': r'row (\d+)' } results = {} for key, pattern in patterns.items(): match = re.search(pattern, log_line) if match: results[key] = int(match.group(1)) return results def map_to_dimm(controller, channel, row): dimm_map = { 0: {0: ['1A', '2A', '3A'], 1: ['1B', '2B', '3B']}, 1: {0: ['4A', '5A', '6A'], 1: ['4B', '5B', '6B']} } try: return dimm_map[controller][channel][row] except (KeyError, IndexError): return "Unknown"
Configure regular EDAC monitoring with this cron job:
# Add to crontab -e */5 * * * * /usr/sbin/edac-util -v | grep -v "0 0" | mail -s "EDAC Errors on $(hostname)" admin@example.com
When dealing with ECC Chipkill errors on HP servers running RHEL 5, the syslog typically shows this pattern:
kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d, row 2, channel 0, label "": k8_edac kernel: MC0: CE - no information available: k8_edac Error Overflow set kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error
Instead of rebooting with HP SmartStart CD, try these approaches while the system is running:
# Method 1: Using edac-utils (requires package installation) sudo apt-get install edac-utils sudo /etc/init.d/edac load sudo edac-util -v # Sample output showing DIMM location: MC0: 1 UE 0 CE K8_EDAC MC0 row:2 channel:0 label:"" (Processors: 0) # Method 2: Direct sysfs access cat /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
Here's a Python script to track DIMM errors in real-time:
#!/usr/bin/env python import os import time from collections import defaultdict error_counts = defaultdict(int) DIMM_MAP = { # Populate with your server's DIMM mapping 'mc0_csrow2_ch0': 'CPU0_DIMM_A1', 'mc0_csrow2_ch1': 'CPU0_DIMM_A2', # Add remaining DIMM slots } def scan_edac_errors(): for mc in os.listdir('/sys/devices/system/edac/mc'): path = f'/sys/devices/system/edac/mc/{mc}' for csrow in os.listdir(f'{path}'): if not csrow.startswith('csrow'): continue for ch in ['ch0', 'ch1']: ce_file = f'{path}/{csrow}/{ch}_ce_count' if os.path.exists(ce_file): with open(ce_file, 'r') as f: count = int(f.read().strip()) if count > error_counts[f'{mc}_{csrow}_{ch}']: print(f"New error on {DIMM_MAP.get(f'{mc}_{csrow}_{ch}', 'UNKNOWN')}") error_counts[f'{mc}_{csrow}_{ch}'] = count while True: scan_edac_errors() time.sleep(60)
The key elements in the error message reveal important information:
- MC0: Memory controller 0
- row 2: Physical DIMM slot (varies by server model)
- channel 0: Memory channel (often corresponds to DIMM position)
- ECC chipkill x4: Error correction type and width
For HP ProLiant servers, the physical DIMM location can be decoded as:
# Example mapping for DL380 Gen8: row 0 = CPU1 DIMM slots 1-6 (A1-A6) row 1 = CPU2 DIMM slots 1-6 (B1-B6) channel 0 = slots 1,3,5 (A1,A3,A5 or B1,B3,B5) channel 1 = slots 2,4,6 (A2,A4,A6 or B2,B4,B6)