Understanding 3Ware RAID Status: DEGRADED vs ECC-ERROR in tw_cli Output for 9650SE Controllers


2 views
// Sample tw_cli output showing critical disk states
$ tw_cli /c0 show all
Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     DEGRADED         u0     931.51 GB   1953525168    5QJ07MAH
p1     ECC-ERROR        u0     931.51 GB   1953525168    5QJ0DCW9
p2     OK               u0     931.51 GB   1953525168    5QJ0DW9C
p3     OK               u0     931.51 GB   1953525168    5QJ0CKXJ

DEGRADED status indicates a disk has completely failed or been removed from the array. The controller can no longer communicate with the disk at a hardware level. In RAID configurations, this typically triggers automatic rebuild operations when a hot spare is available.

ECC-ERROR status represents a more subtle failure where the disk remains accessible but returns corrupt data. This often manifests during rebuild operations when the controller encounters uncorrectable read errors. The disk mechanics and firmware are still functional, but data integrity cannot be guaranteed.

// Alarm log showing the failure progression
$ tw_cli /c0 show alarms
c0   [Sun Nov 20 07:47:23 2011]  INFO      Rebuild started: unit=0
c0   [Sun Nov 20 08:20:12 2011]  ERROR     Drive ECC error reported: port=1
c0   [Sun Nov 20 08:20:12 2011]  ERROR     Source drive error occurred: port=1
c0   [Sun Nov 20 08:20:12 2011]  ERROR     Rebuild failed: unit=0

This sequence reveals a cascade failure:

  1. Initial disk failure (port 0 marked DEGRADED)
  2. Rebuild attempt started using parity data
  3. Second disk (port 1) encountered ECC errors during rebuild
  4. Rebuild process aborted at 97% completion

For ECC error scenarios, consider these tw_cli commands:

# Force continuation of rebuild despite ECC errors
tw_cli /c0 set ignoreECC=on

# Manually restart rebuild process
tw_cli /c0/u0 start rebuild disk=p2

# Check SMART attributes of problematic disk
tw_cli /c0/p1 show smart

Important caveats when using ignoreECC:

  • This should only be used as a last resort for data recovery
  • Any sectors with ECC errors will contain corrupt data in the rebuilt array
  • Immediately backup data after successful rebuild
  • Replace all disks showing errors as soon as possible

Automated monitoring script for 3Ware arrays:

#!/bin/bash
CONTROLLER="/c0"
LOG_FILE="/var/log/3ware_monitor.log"

# Check array status
STATUS=$(tw_cli $CONTROLLER show all | grep -E 'DEGRADED|ECC-ERROR')

if [ -n "$STATUS" ]; then
    echo "[$(date)] CRITICAL: Disk errors detected" >> $LOG_FILE
    echo "$STATUS" >> $LOG_FILE
    # Add email alert or other notification here
fi

# Force periodic verification (recommended weekly)
tw_cli $CONTROLLER/v0 start verify verifyby=unit &>> $LOG_FILE

When working with 3Ware's tw_cli utility, two critical disk states often cause confusion:

// Example tw_cli output showing problem states
/c0 show all | grep -E 'DEGRADED|ECC-ERROR'
Port   Status           Unit   Size        Blocks        Serial
p0     DEGRADED         u0     931.51 GB   1953525168    5QJ07MAH
p1     ECC-ERROR        u0     931.51 GB   1953525168    5QJ0DCW9

DEGRADED indicates a physical disk failure where the RAID controller can no longer communicate with the drive. This typically requires immediate replacement.

ECC-ERROR signifies correctable read errors or sector-level corruption. The controller detected checksum mismatches during rebuild operations.

The alarm log sequence reveals the failure progression:

  1. Initial DEGRADED state on port 0 (p0)
  2. Rebuild attempted using parity data
  3. ECC errors detected on port 1 (p1) during rebuild
  4. Rebuild paused at 97% completion

For administrators facing similar situations:

# First, verify the array status
tw_cli /c0 show all

# Attempt forced rebuild (use with caution)
tw_cli /c0/u0 start rebuild ignoreECC=yes

# Check SMART attributes on problematic drives
smartctl -a /dev/twa0 -d 3ware,1

To avoid this situation:

# Enable periodic verification (recommended weekly)
tw_cli /c0/u0 start verify schedule=weekly

# Set email alerts for early detection
tw_cli /c0 set notification=email alert=all email=admin@example.com

For critical data situations:

  1. Create sector-by-sector disk images using ddrescue
  2. Attempt reconstruction using RAID reconstruction tools
  3. Consult professional data recovery services for physical media issues

For newer 3Ware/Avago/Broadcom controllers:

# Check battery backup unit status
tw_cli /c0 show bbustatus

# View detailed error counters
tw_cli /c0 show errors