How to Diagnose and Handle ECC Correctable Errors in Server Memory: A Sysadmin’s Guide

When working with Sun X2200-M2 servers (or any ECC-enabled hardware), you'll encounter two primary error types:

# Sample ECC error classification
ECC_ERRORS = {
    'correctable': {
        'severity': 'warning',
        'action': 'single-bit errors fixed by ECC',
        'system_impact': 'continues operation'
    },
    'uncorrectable': {
        'severity': 'critical',
        'action': 'multi-bit errors trigger resets',
        'system_impact': 'system crash/halt required'
    }
}

Your IPMI logs show correctable errors from CPU0 DIMM2, while kernel EDAC messages reveal more frequent occurrences. This discrepancy happens because:

IPMI logs at hardware level with threshold triggers
EDAC reports every corrected error at OS level

For deeper analysis, use these commands:

# Check detailed memory error counters
ipmitool sel elist -v | grep -i ECC

# Get per-dimm error stats (Linux-specific)
grep -H . /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

While single correctable errors aren't immediately dangerous, they indicate:

Aging memory chips nearing failure
Marginal voltage/timing issues
Possible physical connection problems

Create a monitoring script like this to track error rates:

#!/bin/bash
ERROR_THRESHOLD=10
DIMM_LOCATION="CPU0_DIMM2"

current_errors=$(ipmitool sel list | grep "$DIMM_LOCATION" | grep "Correctable ECC" | wc -l)

if [ $current_errors -gt $ERROR_THRESHOLD ]; then
    echo "WARNING: $DIMM_LOCATION has $current_errors correctable ECC errors" | \
    mail -s "ECC Warning" admin@example.com
fi

For mission-critical systems:

Error Rate	Recommended Action
< 1/day	Monitor weekly logs
1-10/day	Schedule replacement within 30 days
> 10/day	Replace immediately during next maintenance window

For non-critical systems, you might adopt a more relaxed approach, but document all occurrences for future reference.

When dealing with persistent correctable errors:

# Force memory refresh (may temporarily reduce errors)
echo 1 > /proc/sys/vm/drop_caches

# Check power supply metrics - voltage fluctuations affect memory
ipmitool dcmi power reading

# Consider running memtester on suspect DIMMs during maintenance
memtester -p 0x42a194 100M 1

When working with Sun X2200-M2 servers equipped with ECC memory, you might encounter warnings like "correctable ECC errors detected" in the eLOM logs. These are different from uncorrectable errors which cause immediate system resets. Correctable errors are handled by the ECC mechanism itself, but they shouldn't be ignored completely.

The error output typically looks like this:

# ssh regress11 ipmitool sel elist
   1 | 05/20/2010 | 14:20:27 | Memory CPU0 DIMM2 | Correctable ECC | Asserted
   2 | 05/20/2010 | 14:33:47 | Memory CPU0 DIMM2 | Correctable ECC | Asserted

Additionally, the kernel might report more frequent EDAC errors:

EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x42a194, offset 0x60, grain 8, syndrome 0xf654, row 4, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error

For continuous monitoring, you can set up a script to parse these logs. Here's a basic Python example:

import subprocess

def check_ecc_errors():
    result = subprocess.run(['ipmitool', 'sel', 'elist'], stdout=subprocess.PIPE)
    lines = result.stdout.decode('utf-8').split('\n')
    
    ecc_errors = [line for line in lines if 'Correctable ECC' in line]
    
    if ecc_errors:
        print(f"Found {len(ecc_errors)} ECC correctable errors:")
        for error in ecc_errors:
            print(error)
    else:
        print("No ECC correctable errors detected")

check_ecc_errors()

While correctable errors don't require immediate action, they indicate potential memory issues. Consider these thresholds:

1-2 errors per week: Monitor closely
3-10 errors per week: Schedule memory replacement
More than 10 errors per week: Replace immediately

For deeper analysis, use these commands:

# Check detailed memory information
dmidecode -t memory

# Check EDAC counters
cat /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

# Check for memory errors in kernel log
dmesg | grep -i ECC

In production environments:

Implement automated monitoring for ECC events
Maintain spare memory modules for critical systems
Document all ECC events and replacement history
Consider replacing memory modules showing any correctable errors in high-availability systems

When replacing memory:

# Proper shutdown procedure
shutdown -h now

# After replacement, verify new memory
memtester 4G 1

# Check for ECC support in new modules
ipmitool sel list | grep -i ECC

ServerDevWorker

How to Diagnose and Handle ECC Correctable Errors in Server Memory: A Sysadmin’s Guide

Related Articles