RAID-5 Dual Disk Failure: Recovery Strategies and Low-Level mdadm Commands for Linux Sysadmins


1 views

When two disks fail simultaneously in a RAID-5 array (especially with large 3TB SATA drives), we're dealing with one of the worst-case scenarios for storage administrators. The Dell PERC controller's behavior you're observing - where one disk shows as "missing" and another as "degraded" - suggests possible controller firmware issues combined with actual disk failures.

Possible failure sequences:

1. Disk 1 fails completely (electrical/mechanical)
2. During rebuild attempt, Disk 3 experiences URE (Unrecoverable Read Error)
3. Controller marks Disk 3 as degraded due to timeout
4. Rebuild stalls at 1% because parity data is now corrupted

First, examine the array status without relying on the PERC BIOS:

# Install mdadm if not present
yum install mdadm -y

# Examine array components
mdadm --examine /dev/sd[b-f] | grep -A5 "Event"

# Force assemble in degraded mode
mdadm --assemble --force /dev/md0 /dev/sd[b,c,d,e,f] --verbose

If standard rebuild fails, try reconstructing metadata:

# Create new array with same parameters
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=5 \\
--force /dev/sd[b,c,d,e,f]

# Then attempt file system check
fsck -vy /dev/md0

Check SMART data on remaining drives:

smartctl -a /dev/sdb | grep -i "reallocated\|pending\|uncorrectable"
for disk in b c d e f; do
    smartctl -H /dev/sd${disk} | grep "test result"
done

If all DIY methods fail and data is critical, professional services can:

  • Read platters directly in clean rooms
  • Reconstruct sector-by-sector using specialized hardware
  • Handle firmware corruption cases

For future setups, consider migrating to RAID-6 or ZFS with proper monitoring:

# Example ZFS pool creation
zpool create tank raidz2 sdb sdc sdd sde sdf
zpool status -v



When two disks fail simultaneously in a RAID-5 array (especially during rebuild operations), you're facing one of storage administration's worst-case scenarios. The mathematical probability isn't as low as you'd hope - research shows a 22% chance of second disk failure during rebuild on large SATA arrays (StorageReview, 2021).

Your scenario exhibits classic symptoms of what we call "cascading failure":

  1. Disk 1 physically fails (potentially bad sectors or controller issues)
  2. Disk 3 experiences read errors during rebuild stress
  3. The RAID controller marks Disk 3 as degraded due to timeout thresholds

First, check the current array status:

sudo mdadm --detail /dev/mdX
cat /proc/mdstat

For forced array assembly (last resort):

sudo mdadm --assemble --force /dev/mdX /dev/sd[bcde]1 --verbose
sudo mdadm --manage /dev/mdX --add /dev/sdf1

When standard rebuilds stall, try these advanced techniques:

# Check disk for bad blocks
sudo badblocks -sv /dev/sdX > badblocks.txt

# Create disk image skipping errors
sudo ddrescue -d -r3 /dev/sdX /mnt/recovery/sdX.img /mnt/recovery/sdX.log

Modify your monitoring to catch early signs:

# SMART monitoring cron job
*/30 * * * * /usr/sbin/smartctl -a /dev/sdX | grep -i "Reallocated_Sector_Ct\|Current_Pending_Sector"

Consider migrating to RAID-6 for arrays >2TB, or implement staggered disk replacement cycles.