How a Single Failing SATA Disk Can Cripple Hardware RAID-10 Arrays: Diagnosis and Solutions for SysAdmins


3 views

During our recent outage, we observed that disk 8 (DEV_ID: /dev/twa0 [3ware_disk_08]) exhibited this exact behavior. The 3ware RAID controller log showed repeated timeout events:

# dmesg | grep twa0
[32451.283911] twa0: ERROR: (0x04:0x0046): Disk timeout: port=8,reset=1
[32451.283945] twa0: ERROR: (0x04:0x0046): SCSI command failed: host=0 channel=0 id=8 lun=0
[32451.284012] twa0: ERROR: (0x04:0x0046): Auto-REMAP failed: port=8

The 3ware 9650SE controller in our system implements a "retry storm" behavior that isn't properly documented. When a disk starts throwing read errors but hasn't completely failed, the controller may:

  • Enter an aggressive retry loop (up to 30 seconds per I/O)
  • Block all array I/O during retries
  • Fail to properly mark the disk as failed

This explains why our monitoring showed disk 8's read errors correlating with system-wide latency spikes.

For immediate mitigation, we implemented these changes:

# Force shorter timeout on the 3ware controller
echo 15 > /sys/bus/pci/drivers/3w-xxxx/0:0X:0X/host0/scsi_host/host0/link_timeout

# SMART monitoring threshold adjustment
smartctl -d 3ware,8 -l error /dev/twa0
smartctl -d 3ware,8 -l selftest /dev/twa0
smartctl -d 3ware,8 -H /dev/twa0

We've since updated our monitoring to catch these scenarios earlier:

# Custom Nagios check for RAID controller timeouts
#!/bin/bash
TWCLI="/usr/sbin/tw_cli"
CRITICAL=$($TWCLI /c0 show | grep -c "INITIALIZING\|VERIFYING")
if [ $CRITICAL -gt 0 ]; then
  echo "CRITICAL: $CRITICAL disks in degraded state"
  exit 2
fi

Additionally, we've switched to ZFS-based storage with more aggressive error handling:

zpool create -o failmode=wait tank mirror /dev/disk/by-id/ata-XXXXX /dev/disk/by-id/ata-YYYYY
zfs set sync=disabled tank
zfs set primarycache=all tank

What we learned the hard way: Enterprise SATA RAID requires enterprise-grade disks with proper TLER (Time Limited Error Recovery). Consumer disks will retry indefinitely, while enterprise disks fail fast.

Our current disk procurement spec now includes:

  • Minimum 1M hours MTBF
  • TLER/CERC support (300-500ms timeout)
  • Explicit RAID-optimized firmware

During a recent production incident, our 3ware-based RAID-10 array experienced catastrophic performance degradation despite being nominally healthy. The smoking gun appeared in SMART logs showing disk 7 developing bad sectors, while disk 8 exhibited strange read error fluctuations that perfectly correlated with the IO wait spikes.

# Sample SMART error output showing critical patterns
Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_07], 
  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 171 to 170
Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 
  16 Currently unreadable (pending) sectors

The 3ware controller's timeout behavior became the critical failure point. When disk 8 entered a degraded state:

  • Each I/O operation hitting problematic sectors triggered 30-second retries
  • The controller's conservative error recovery locked the entire bus
  • DRBD replication compounded the problem by retrying failed writes

Here's the Python script we implemented for proactive disk health monitoring:

import subprocess
import time
from collections import deque

DISK_TIMEOUT_THRESHOLD = 5  # seconds
ERROR_WINDOW_SIZE = 10
error_history = {}

def check_disk_health(device):
    try:
        start = time.time()
        result = subprocess.run(
            ["smartctl", "-A", device],
            capture_output=True,
            text=True,
            timeout=DISK_TIMEOUT_THRESHOLD
        )
        
        # Parse critical SMART attributes
        raw_read_error = parse_smart_attribute(result.stdout, "Raw_Read_Error_Rate")
        pending_sectors = parse_smart_attribute(result.stdout, "Current_Pending_Sector")
        
        # Track trends
        if device not in error_history:
            error_history[device] = deque(maxlen=ERROR_WINDOW_SIZE)
            
        error_history[device].append((raw_read_error, pending_sectors))
        
        return analyze_trends(device)
        
    except subprocess.TimeoutExpired:
        return {"status": "CRITICAL", "reason": f"Timeout ({DISK_TIMEOUT_THRESHOLD}s)"}

For 3ware/LSI cards, these CLI commands proved essential:

# Reduce error recovery time (3ware-specific)
tw_cli /c0 set dr=disable
tw_cli /c0 set ecc=disable
tw_cli /c0 set cc=on

# Alternative for LSI MegaRAID:
MegaCli -AdpSetProp -EnableJBOD -0 -aALL
MegaCli -AdpSetProp -DiskTimeOut -30 -aALL

Key takeaways from our incident:

  1. RAID-10 provides redundancy but not performance isolation
  2. Controller firmware plays a critical role in failure scenarios
  3. SMART monitoring alone won't catch all failure modes
  4. Disk timeouts should be tuned for your workload

We've since implemented a multi-layered monitoring approach combining:

  • Real-time io latency percentiles (via Prometheus)
  • Controller-level error counters
  • Periodic destructive read tests