RAID 6 Data Recovery: Handling Triple Disk Failure in 16-Drive SAS Array During OS Boot Failure


2 views

html

When dealing with enterprise storage systems, encountering multiple simultaneous disk failures in RAID 6 arrays can create complex recovery scenarios. Your situation involving:

  • 16-drive SAS configuration
  • Triple disk failure
  • Degraded array state
  • OS boot failure

requires careful handling to avoid permanent data loss. Unlike RAID 5 which tolerates single disk failure, RAID 6 theoretically withstands two disk failures - but three failures push it beyond design limits.

Step 1: Hardware Assessment
First verify physical disk health through controller utilities. Example SAS diagnostic command:

sas2ircu 0 display
# Returns controller and disk status
# Look for "Ready" state and correct WWN

Step 2: Create Sector-Level Images
Before any recovery attempts, create forensic copies of failed drives:

dd if=/dev/sdX of=/mnt/backup/sdX.img bs=1M conv=noerror,sync
# Repeat for each failed drive

Using a live CD with proper SAS support is indeed the recommended approach:

# Recommended recovery distros:
# - SystemRescueCD (with mdadm support)
# - Knoppix STD
# - Ubuntu Server Live CD

# After booting, load appropriate modules:
modprobe mpt3sas
modprobe raid6

For advanced recovery scenarios, consider these approaches:

  • Force assemble degraded array:
mdadm --assemble --force /dev/md0 /dev/sd[abcdefghijk]
# Where sd[abcdefghijk] represents remaining functional disks
  • Manual P/Q parity recalculation: (Requires deep technical knowledge)
# This requires custom scripting based on your stripe size
# Pseudo-code example:
for stripe in $(seq 0 $total_stripes); do
    recalculate_parity --stripe $stripe --disks /dev/sd[a-p]
done

When DIY methods fail, consider professional tools:

Tool Best For
R-Studio File-level recovery
UFS Explorer RAID reconstruction
ReclaiMe Automatic parameter detection

For future configurations, consider improving resilience:

# Example: Implementing hot spares in mdadm
mdadm --grow /dev/md0 --raid-devices=16 --spare-devices=2

Remember that RAID 6 with 16 drives has significant rebuild times - consider alternatives like RAID 60 for large arrays.


When three disks fail simultaneously in a RAID 6 array (which normally tolerates two-disk failures), the array becomes completely inaccessible. This is particularly problematic with SAS drives in enterprise environments where immediate data access is critical. The system's inability to boot confirms the storage subsystem failure has cascaded to the OS level.

Before attempting recovery:

  • Physically label all failed drives
  • Document the original disk order in the array
  • Check SMART status of remaining disks: smartctl -a /dev/sdX
  • Create sector-by-sector images of failed drives if possible: dd if=/dev/sdX of=/mnt/backup/failed_disk1.img bs=1M conv=noerror,sync

Using a live CD can bypass the corrupted OS installation:


# Example SystemRescueCD boot commands:
boot: rescue64 dodisk=1
# Then mount remaining array components:
mdadm --assemble --force /dev/md0 /dev/sd[b-z] --verbose
# Check array status:
mdadm --detail /dev/md0

When standard tools fail, consider:


#!/usr/bin/env python3
# RAID 6 recovery script fragment
import os
from raiddriver import RAID6Driver

def rebuild_parity(drives):
    raid = RAID6Driver(stripe_size=512)
    try:
        raid.load_config('/etc/mdadm.conf')
        return raid.rebuild(max_failures=3)
    except RAIDDegradedError:
        print("Insufficient disks for automatic rebuild")
        return False

For SAS environments:

  • Use sas2ircu to check controller status
  • SAS drives often report different failure modes than SATA
  • Enterprise arrays may have vendor-specific recovery procedures

Post-recovery recommendations:


# Add to cron for regular array checks
0 3 * * * /usr/sbin/mdadm --monitor --scan --mail=admin@domain.com
# Better SMART monitoring
*/30 * * * * /usr/sbin/smartd --quietmode=silent