How a Single Failing SATA Disk Can Cripple Hardware RAID-10 Arrays: Diagnosis and Solutions for SysAdmins

During our recent outage, we observed that disk 8 (DEV_ID: /dev/twa0 [3ware_disk_08]) exhibited this exact behavior. The 3ware RAID controller log showed repeated timeout events:

# dmesg | grep twa0
[32451.283911] twa0: ERROR: (0x04:0x0046): Disk timeout: port=8,reset=1
[32451.283945] twa0: ERROR: (0x04:0x0046): SCSI command failed: host=0 channel=0 id=8 lun=0
[32451.284012] twa0: ERROR: (0x04:0x0046): Auto-REMAP failed: port=8

The 3ware 9650SE controller in our system implements a "retry storm" behavior that isn't properly documented. When a disk starts throwing read errors but hasn't completely failed, the controller may:

Enter an aggressive retry loop (up to 30 seconds per I/O)
Block all array I/O during retries
Fail to properly mark the disk as failed

This explains why our monitoring showed disk 8's read errors correlating with system-wide latency spikes.

For immediate mitigation, we implemented these changes:

# Force shorter timeout on the 3ware controller
echo 15 > /sys/bus/pci/drivers/3w-xxxx/0:0X:0X/host0/scsi_host/host0/link_timeout

# SMART monitoring threshold adjustment
smartctl -d 3ware,8 -l error /dev/twa0
smartctl -d 3ware,8 -l selftest /dev/twa0
smartctl -d 3ware,8 -H /dev/twa0

We've since updated our monitoring to catch these scenarios earlier:

# Custom Nagios check for RAID controller timeouts
#!/bin/bash
TWCLI="/usr/sbin/tw_cli"
CRITICAL=$($TWCLI /c0 show | grep -c "INITIALIZING\|VERIFYING")
if [ $CRITICAL -gt 0 ]; then
  echo "CRITICAL: $CRITICAL disks in degraded state"
  exit 2
fi

Additionally, we've switched to ZFS-based storage with more aggressive error handling:

zpool create -o failmode=wait tank mirror /dev/disk/by-id/ata-XXXXX /dev/disk/by-id/ata-YYYYY
zfs set sync=disabled tank
zfs set primarycache=all tank

What we learned the hard way: Enterprise SATA RAID requires enterprise-grade disks with proper TLER (Time Limited Error Recovery). Consumer disks will retry indefinitely, while enterprise disks fail fast.

Our current disk procurement spec now includes:

Minimum 1M hours MTBF
TLER/CERC support (300-500ms timeout)
Explicit RAID-optimized firmware

During a recent production incident, our 3ware-based RAID-10 array experienced catastrophic performance degradation despite being nominally healthy. The smoking gun appeared in SMART logs showing disk 7 developing bad sectors, while disk 8 exhibited strange read error fluctuations that perfectly correlated with the IO wait spikes.

# Sample SMART error output showing critical patterns
Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_07], 
  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 171 to 170
Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 
  16 Currently unreadable (pending) sectors

The 3ware controller's timeout behavior became the critical failure point. When disk 8 entered a degraded state:

Each I/O operation hitting problematic sectors triggered 30-second retries
The controller's conservative error recovery locked the entire bus
DRBD replication compounded the problem by retrying failed writes

Here's the Python script we implemented for proactive disk health monitoring:

import subprocess
import time
from collections import deque

DISK_TIMEOUT_THRESHOLD = 5  # seconds
ERROR_WINDOW_SIZE = 10
error_history = {}

def check_disk_health(device):
    try:
        start = time.time()
        result = subprocess.run(
            ["smartctl", "-A", device],
            capture_output=True,
            text=True,
            timeout=DISK_TIMEOUT_THRESHOLD
        )
        
        # Parse critical SMART attributes
        raw_read_error = parse_smart_attribute(result.stdout, "Raw_Read_Error_Rate")
        pending_sectors = parse_smart_attribute(result.stdout, "Current_Pending_Sector")
        
        # Track trends
        if device not in error_history:
            error_history[device] = deque(maxlen=ERROR_WINDOW_SIZE)
            
        error_history[device].append((raw_read_error, pending_sectors))
        
        return analyze_trends(device)
        
    except subprocess.TimeoutExpired:
        return {"status": "CRITICAL", "reason": f"Timeout ({DISK_TIMEOUT_THRESHOLD}s)"}

For 3ware/LSI cards, these CLI commands proved essential:

# Reduce error recovery time (3ware-specific)
tw_cli /c0 set dr=disable
tw_cli /c0 set ecc=disable
tw_cli /c0 set cc=on

# Alternative for LSI MegaRAID:
MegaCli -AdpSetProp -EnableJBOD -0 -aALL
MegaCli -AdpSetProp -DiskTimeOut -30 -aALL

Key takeaways from our incident:

RAID-10 provides redundancy but not performance isolation
Controller firmware plays a critical role in failure scenarios
SMART monitoring alone won't catch all failure modes
Disk timeouts should be tuned for your workload

We've since implemented a multi-layered monitoring approach combining:

Real-time io latency percentiles (via Prometheus)
Controller-level error counters
Periodic destructive read tests

ServerDevWorker

How a Single Failing SATA Disk Can Cripple Hardware RAID-10 Arrays: Diagnosis and Solutions for SysAdmins

Related Articles