When working with Sun X2200-M2 servers (or any ECC-enabled hardware), you'll encounter two primary error types:
# Sample ECC error classification
ECC_ERRORS = {
'correctable': {
'severity': 'warning',
'action': 'single-bit errors fixed by ECC',
'system_impact': 'continues operation'
},
'uncorrectable': {
'severity': 'critical',
'action': 'multi-bit errors trigger resets',
'system_impact': 'system crash/halt required'
}
}
Your IPMI logs show correctable errors from CPU0 DIMM2, while kernel EDAC messages reveal more frequent occurrences. This discrepancy happens because:
- IPMI logs at hardware level with threshold triggers
- EDAC reports every corrected error at OS level
For deeper analysis, use these commands:
# Check detailed memory error counters
ipmitool sel elist -v | grep -i ECC
# Get per-dimm error stats (Linux-specific)
grep -H . /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
While single correctable errors aren't immediately dangerous, they indicate:
- Aging memory chips nearing failure
- Marginal voltage/timing issues
- Possible physical connection problems
Create a monitoring script like this to track error rates:
#!/bin/bash
ERROR_THRESHOLD=10
DIMM_LOCATION="CPU0_DIMM2"
current_errors=$(ipmitool sel list | grep "$DIMM_LOCATION" | grep "Correctable ECC" | wc -l)
if [ $current_errors -gt $ERROR_THRESHOLD ]; then
echo "WARNING: $DIMM_LOCATION has $current_errors correctable ECC errors" | \
mail -s "ECC Warning" admin@example.com
fi
For mission-critical systems:
Error Rate | Recommended Action |
---|---|
< 1/day | Monitor weekly logs |
1-10/day | Schedule replacement within 30 days |
> 10/day | Replace immediately during next maintenance window |
For non-critical systems, you might adopt a more relaxed approach, but document all occurrences for future reference.
When dealing with persistent correctable errors:
# Force memory refresh (may temporarily reduce errors)
echo 1 > /proc/sys/vm/drop_caches
# Check power supply metrics - voltage fluctuations affect memory
ipmitool dcmi power reading
# Consider running memtester on suspect DIMMs during maintenance
memtester -p 0x42a194 100M 1
When working with Sun X2200-M2 servers equipped with ECC memory, you might encounter warnings like "correctable ECC errors detected" in the eLOM logs. These are different from uncorrectable errors which cause immediate system resets. Correctable errors are handled by the ECC mechanism itself, but they shouldn't be ignored completely.
The error output typically looks like this:
# ssh regress11 ipmitool sel elist
1 | 05/20/2010 | 14:20:27 | Memory CPU0 DIMM2 | Correctable ECC | Asserted
2 | 05/20/2010 | 14:33:47 | Memory CPU0 DIMM2 | Correctable ECC | Asserted
Additionally, the kernel might report more frequent EDAC errors:
EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x42a194, offset 0x60, grain 8, syndrome 0xf654, row 4, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error
For continuous monitoring, you can set up a script to parse these logs. Here's a basic Python example:
import subprocess
def check_ecc_errors():
result = subprocess.run(['ipmitool', 'sel', 'elist'], stdout=subprocess.PIPE)
lines = result.stdout.decode('utf-8').split('\n')
ecc_errors = [line for line in lines if 'Correctable ECC' in line]
if ecc_errors:
print(f"Found {len(ecc_errors)} ECC correctable errors:")
for error in ecc_errors:
print(error)
else:
print("No ECC correctable errors detected")
check_ecc_errors()
While correctable errors don't require immediate action, they indicate potential memory issues. Consider these thresholds:
- 1-2 errors per week: Monitor closely
- 3-10 errors per week: Schedule memory replacement
- More than 10 errors per week: Replace immediately
For deeper analysis, use these commands:
# Check detailed memory information
dmidecode -t memory
# Check EDAC counters
cat /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
# Check for memory errors in kernel log
dmesg | grep -i ECC
In production environments:
- Implement automated monitoring for ECC events
- Maintain spare memory modules for critical systems
- Document all ECC events and replacement history
- Consider replacing memory modules showing any correctable errors in high-availability systems
When replacing memory:
# Proper shutdown procedure
shutdown -h now
# After replacement, verify new memory
memtester 4G 1
# Check for ECC support in new modules
ipmitool sel list | grep -i ECC