Diagnosing ECC Chipkill Errors on HP Servers: Identifying Faulty DIMMs Without Downtime


4 views

When dealing with ECC Chipkill errors on HP servers running RHEL 5, the kernel messages follow a specific pattern that can help pinpoint the issue:

kernel: EDAC k8 MC0: general bus error: participating processor(local node response)
kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d
kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error

The key elements in these messages that can help identify the faulty DIMM include:

  • MC0 - Memory Controller 0
  • Channel 0 - The memory channel where the error occurred
  • Row 2 - The physical rank/bank of the memory module

For production systems where downtime isn't an option, consider these Linux tools:

# Install edac-utils if not already present
yum install edac-utils

# Check memory error counters
edac-util -v

# For more detailed information:
grep -i "memory" /var/log/messages

HP servers typically follow this memory slot numbering convention:

Memory Controller 0:
Channel 0: DIMM 1A, 2A, 3A
Channel 1: DIMM 1B, 2B, 3B

Memory Controller 1:
Channel 0: DIMM 4A, 5A, 6A
Channel 1: DIMM 4B, 5B, 6B

Here's a Python script to help parse the logs and suggest the likely faulty DIMM:

import re

def parse_edac_log(log_line):
    patterns = {
        'controller': r'MC(\d+)',
        'channel': r'channel (\d+)',
        'row': r'row (\d+)'
    }
    
    results = {}
    for key, pattern in patterns.items():
        match = re.search(pattern, log_line)
        if match:
            results[key] = int(match.group(1))
    return results

def map_to_dimm(controller, channel, row):
    dimm_map = {
        0: {0: ['1A', '2A', '3A'], 1: ['1B', '2B', '3B']},
        1: {0: ['4A', '5A', '6A'], 1: ['4B', '5B', '6B']}
    }
    try:
        return dimm_map[controller][channel][row]
    except (KeyError, IndexError):
        return "Unknown"

Configure regular EDAC monitoring with this cron job:

# Add to crontab -e
*/5 * * * * /usr/sbin/edac-util -v | grep -v "0 0" | mail -s "EDAC Errors on $(hostname)" admin@example.com

When dealing with ECC Chipkill errors on HP servers running RHEL 5, the syslog typically shows this pattern:

kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d, row 2, channel 0, label "": k8_edac
kernel: MC0: CE - no information available: k8_edac Error Overflow set
kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error

Instead of rebooting with HP SmartStart CD, try these approaches while the system is running:

# Method 1: Using edac-utils (requires package installation)
sudo apt-get install edac-utils
sudo /etc/init.d/edac load
sudo edac-util -v

# Sample output showing DIMM location:
MC0: 1 UE 0 CE  K8_EDAC MC0 row:2 channel:0 label:""  (Processors: 0)

# Method 2: Direct sysfs access
cat /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

Here's a Python script to track DIMM errors in real-time:

#!/usr/bin/env python
import os
import time
from collections import defaultdict

error_counts = defaultdict(int)
DIMM_MAP = {
    # Populate with your server's DIMM mapping
    'mc0_csrow2_ch0': 'CPU0_DIMM_A1',
    'mc0_csrow2_ch1': 'CPU0_DIMM_A2',
    # Add remaining DIMM slots
}

def scan_edac_errors():
    for mc in os.listdir('/sys/devices/system/edac/mc'):
        path = f'/sys/devices/system/edac/mc/{mc}'
        for csrow in os.listdir(f'{path}'):
            if not csrow.startswith('csrow'):
                continue
            for ch in ['ch0', 'ch1']:
                ce_file = f'{path}/{csrow}/{ch}_ce_count'
                if os.path.exists(ce_file):
                    with open(ce_file, 'r') as f:
                        count = int(f.read().strip())
                        if count > error_counts[f'{mc}_{csrow}_{ch}']:
                            print(f"New error on {DIMM_MAP.get(f'{mc}_{csrow}_{ch}', 'UNKNOWN')}")
                            error_counts[f'{mc}_{csrow}_{ch}'] = count

while True:
    scan_edac_errors()
    time.sleep(60)

The key elements in the error message reveal important information:

  • MC0: Memory controller 0
  • row 2: Physical DIMM slot (varies by server model)
  • channel 0: Memory channel (often corresponds to DIMM position)
  • ECC chipkill x4: Error correction type and width

For HP ProLiant servers, the physical DIMM location can be decoded as:

# Example mapping for DL380 Gen8:
row 0 = CPU1 DIMM slots 1-6 (A1-A6)
row 1 = CPU2 DIMM slots 1-6 (B1-B6)
channel 0 = slots 1,3,5 (A1,A3,A5 or B1,B3,B5)
channel 1 = slots 2,4,6 (A2,A4,A6 or B2,B4,B6)