How to Remove “Faulty” State from RAID 1 Array Without Rebuilding (mdadm Guide)


4 views

In Linux software RAID (mdadm), when a drive is marked as "faulty", it typically means the system has detected an I/O error or other issue that makes the drive unreliable for the array. This can happen for various reasons:

  • Physical disk errors
  • Accidental administrative commands
  • Temporary connection issues
  • False positives from disk health monitoring

Before proceeding, verify this is a false positive situation:

# Check disk health
smartctl -a /dev/sdX

# Check kernel logs
dmesg | grep -i error

# Check mdadm detail
mdadm --detail /dev/mdX

If you see actual hardware errors, replacing the disk is the correct solution.

When you're certain the drive is healthy, follow these steps:

1. Stop the Array Temporarily

# Unmount filesystems first
umount /mount_point

# Stop the array
mdadm --stop /dev/mdX

2. Reassemble with Clean Flag

This tells mdadm to trust the existing data:

mdadm --assemble /dev/mdX /dev/sdX1 /dev/sdY1 --force --assume-clean

3. Remove Failed Designation

mdadm --manage /dev/mdX --remove /dev/sdX1
mdadm --manage /dev/mdX --add /dev/sdX1

For systems where stopping isn't possible:

# Mark as failed (if not already)
mdadm --manage /dev/mdX --fail /dev/sdX1

# Remove from array
mdadm --manage /dev/mdX --remove /dev/sdX1

# Re-add the same device
mdadm --manage /dev/mdX --add /dev/sdX1

For frequent false positives, create a monitoring script:

#!/bin/bash
MDSTATUS=$(mdadm --detail /dev/mdX | grep "faulty")
if [[ $MDSTATUS == *"/dev/sdX1"* ]]; then
    smartctl -a /dev/sdX1 | grep -q "No Errors Logged" && 
    mdadm --manage /dev/mdX --remove /dev/sdX1 && 
    mdadm --manage /dev/mdX --add /dev/sdX1
fi
  • Always have complete backups before manipulating RAID arrays
  • Monitor sync progress after re-adding: cat /proc/mdstat
  • Consider adding a spare: mdadm --grow /dev/mdX --spare-devices=1

When working with RAID 1 arrays in Linux, you might encounter situations where a disk is incorrectly marked as "faulty" due to:

  • Accidental removal of the wrong disk
  • Temporary I/O errors
  • False-positive SMART warnings
  • Improper shutdown procedures

First verify your array status:


cat /proc/mdstat
mdadm --detail /dev/md0

Sample output might show:


Personalities : [raid1] 
md0 : active raid1 sdb1[2](F) sda1[1]
      976630528 blocks super 1.2 [2/1] [_U]

To safely remove the faulty flag without array reconstruction:


# Stop the array first
mdadm --stop /dev/md0

# Reassemble with --force flag
mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1

# Verify disk is back in sync
watch cat /proc/mdstat

For live systems where stopping isn't possible:


# Remove then re-add the disk
mdadm /dev/md0 --fail /dev/sdb1
mdadm /dev/md0 --remove /dev/sdb1
mdadm /dev/md0 --add /dev/sdb1

# Check resync progress
mdadm --detail /dev/md0 | grep -i recovery
  • Implement proper monitoring with mdadm --monitor
  • Set up email alerts for RAID events
  • Schedule regular array checks: echo check > /sys/block/md0/md/sync_action
  • Use UUIDs instead of device names in mdadm.conf

If manual recovery fails, you may need to rebuild:


mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1 --assume-clean

Warning: This should only be done with verified good disks and proper backups.