How to Cancel an In-Progress ZFS Pool Disk Replacement: Resolving Stuck Replacing State


2 views

When dealing with ZFS storage pools, you might encounter situations where a disk replacement gets stuck in progress. This typically happens when:

  • The original disk shows SMART errors but isn't completely failed
  • The replacement disk develops issues (like DMA_WRITE errors) during resilvering
  • The process keeps restarting at certain percentages

After running zpool scrub -s tank, the scrub operation stops but the disks remain in "replacing" state. This prevents initiating another replacement operation. The pool status might look something like this:

pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since [timestamp]
    10.1G scanned at 42.4M/s, 1.01G issued at 4.24M/s, 10.1G total
    1.01G resilvered, 10.00% done
config:

NAME                     STATE     READ WRITE CKSUM
tank                     DEGRADED     0     0     0
  raidz1-0               DEGRADED     0     0     0
    da0                  ONLINE       0     0     0
    da1                  ONLINE       0     0     0
    replacing-2          DEGRADED     0     0     0
      da2                ONLINE       0     0     0
      da3                ONLINE       0     0     4
    da4                  ONLINE       0     0     0

To properly cancel the replacement and return to the original disk:

  1. First, detach the replacement disk:
    zpool detach tank da3
  2. Then, clear the replacing state:
    zpool replace -w tank da2
  3. Verify the status:
    zpool status tank

If you want to temporarily use a USB disk instead:

# First cancel existing replacement as above
zpool detach tank da3
zpool replace -w tank da2

# Then offline the problematic disk
zpool offline tank da2

# Finally, attach the USB disk
zpool replace tank da2 /dev/daX
  • Always have good backups before performing storage operations
  • Monitor SMART status regularly with smartctl -a /dev/daX
  • Consider setting up email alerts for ZFS events
  • For production systems, consider using hot spares instead of manual replacements

If the above commands don't work, try exporting and reimporting the pool:

zpool export tank
zpool import tank

For persistent issues, you might need to force the import:

zpool import -f tank

When a ZFS disk replacement gets stuck mid-process (especially due to hardware errors), you might encounter a situation where disks remain permanently marked as "replacing" in zpool status:

# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered...
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Oct 23 14:32:46 2023
        10.1G scanned at 12.4M/s, 1.01G issued at 1.24M/s, 10.1G total
        0 resilvered, 10.00% done

config:

    NAME                      STATE     READ WRITE CKSUM
    tank                      DEGRADED     0     0     0
      raidz1-0                DEGRADED     0     0     0
        replacing-0           OFFLINE      0     0     0
          ada1                OFFLINE      0     0     0  (DMA_WRITE errors)
          ada2                ONLINE       0     0     0  (resilvering, DMA_WRITE errors)
        ada3                  ONLINE       0     0     0
        ada4                  ONLINE       0     0     0
        ada5                  ONLINE       0     0     0

The typical zpool scrub -s only stops the resilvering process but doesn't clear the replacement state. Attempting to detach either disk results in:

# zpool detach tank ada1
cannot detach ada1: no valid replicas

Here's the step-by-step solution for FreeBSD:

# First export the pool (ensure no active operations)
zpool export tank

# Import with the original disk only, forcing reversion
zpool import -d /dev/ tank -f -F

# Verify the original disk is back in normal state
zpool status tank

# Now safely attach your USB temporary disk
zpool attach tank ada1 da0

If the above doesn't work, edit the ZFS config cache:

# Locate the cache file (FreeBSD specific)
find / -name "*.cache" -exec grep -l "tank" {} \;

# Edit with vi/nano to remove replacing entries
nano /boot/zfs/zpool.cache

Look for lines containing replacing- and remove the entire device subtree.

When dealing with problematic disks:

  1. Always pre-test replacement disks: badblocks -ws /dev/da0
  2. Use -o replace=on for safer replacements: zpool replace -o replace=on tank ada1 ada2
  3. Monitor with: zpool status -v tank & smartctl -a /dev/ada1

Remember that forced operations may require a reboot to fully clear kernel device states.