During our recent outage, we observed that disk 8 (DEV_ID: /dev/twa0 [3ware_disk_08]) exhibited this exact behavior. The 3ware RAID controller log showed repeated timeout events:
# dmesg | grep twa0
[32451.283911] twa0: ERROR: (0x04:0x0046): Disk timeout: port=8,reset=1
[32451.283945] twa0: ERROR: (0x04:0x0046): SCSI command failed: host=0 channel=0 id=8 lun=0
[32451.284012] twa0: ERROR: (0x04:0x0046): Auto-REMAP failed: port=8
The 3ware 9650SE controller in our system implements a "retry storm" behavior that isn't properly documented. When a disk starts throwing read errors but hasn't completely failed, the controller may:
- Enter an aggressive retry loop (up to 30 seconds per I/O)
- Block all array I/O during retries
- Fail to properly mark the disk as failed
This explains why our monitoring showed disk 8's read errors correlating with system-wide latency spikes.
For immediate mitigation, we implemented these changes:
# Force shorter timeout on the 3ware controller
echo 15 > /sys/bus/pci/drivers/3w-xxxx/0:0X:0X/host0/scsi_host/host0/link_timeout
# SMART monitoring threshold adjustment
smartctl -d 3ware,8 -l error /dev/twa0
smartctl -d 3ware,8 -l selftest /dev/twa0
smartctl -d 3ware,8 -H /dev/twa0
We've since updated our monitoring to catch these scenarios earlier:
# Custom Nagios check for RAID controller timeouts
#!/bin/bash
TWCLI="/usr/sbin/tw_cli"
CRITICAL=$($TWCLI /c0 show | grep -c "INITIALIZING\|VERIFYING")
if [ $CRITICAL -gt 0 ]; then
echo "CRITICAL: $CRITICAL disks in degraded state"
exit 2
fi
Additionally, we've switched to ZFS-based storage with more aggressive error handling:
zpool create -o failmode=wait tank mirror /dev/disk/by-id/ata-XXXXX /dev/disk/by-id/ata-YYYYY
zfs set sync=disabled tank
zfs set primarycache=all tank
What we learned the hard way: Enterprise SATA RAID requires enterprise-grade disks with proper TLER (Time Limited Error Recovery). Consumer disks will retry indefinitely, while enterprise disks fail fast.
Our current disk procurement spec now includes:
- Minimum 1M hours MTBF
- TLER/CERC support (300-500ms timeout)
- Explicit RAID-optimized firmware
During a recent production incident, our 3ware-based RAID-10 array experienced catastrophic performance degradation despite being nominally healthy. The smoking gun appeared in SMART logs showing disk 7 developing bad sectors, while disk 8 exhibited strange read error fluctuations that perfectly correlated with the IO wait spikes.
# Sample SMART error output showing critical patterns
Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_07],
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 171 to 170
Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10],
16 Currently unreadable (pending) sectors
The 3ware controller's timeout behavior became the critical failure point. When disk 8 entered a degraded state:
- Each I/O operation hitting problematic sectors triggered 30-second retries
- The controller's conservative error recovery locked the entire bus
- DRBD replication compounded the problem by retrying failed writes
Here's the Python script we implemented for proactive disk health monitoring:
import subprocess
import time
from collections import deque
DISK_TIMEOUT_THRESHOLD = 5 # seconds
ERROR_WINDOW_SIZE = 10
error_history = {}
def check_disk_health(device):
try:
start = time.time()
result = subprocess.run(
["smartctl", "-A", device],
capture_output=True,
text=True,
timeout=DISK_TIMEOUT_THRESHOLD
)
# Parse critical SMART attributes
raw_read_error = parse_smart_attribute(result.stdout, "Raw_Read_Error_Rate")
pending_sectors = parse_smart_attribute(result.stdout, "Current_Pending_Sector")
# Track trends
if device not in error_history:
error_history[device] = deque(maxlen=ERROR_WINDOW_SIZE)
error_history[device].append((raw_read_error, pending_sectors))
return analyze_trends(device)
except subprocess.TimeoutExpired:
return {"status": "CRITICAL", "reason": f"Timeout ({DISK_TIMEOUT_THRESHOLD}s)"}
For 3ware/LSI cards, these CLI commands proved essential:
# Reduce error recovery time (3ware-specific)
tw_cli /c0 set dr=disable
tw_cli /c0 set ecc=disable
tw_cli /c0 set cc=on
# Alternative for LSI MegaRAID:
MegaCli -AdpSetProp -EnableJBOD -0 -aALL
MegaCli -AdpSetProp -DiskTimeOut -30 -aALL
Key takeaways from our incident:
- RAID-10 provides redundancy but not performance isolation
- Controller firmware plays a critical role in failure scenarios
- SMART monitoring alone won't catch all failure modes
- Disk timeouts should be tuned for your workload
We've since implemented a multi-layered monitoring approach combining:
- Real-time io latency percentiles (via Prometheus)
- Controller-level error counters
- Periodic destructive read tests