How to Diagnose and Fix ZFS Pool Errors When “zpool status” Shows Unrecoverable Device Issues


6 views

The zpool status output indicates that one of your devices in the RAIDZ1 vdev has experienced both read (3 errors) and write (1.13 million errors) issues. The key lines to focus on are:

gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca  ONLINE       3 1.13M     0
status: One or more devices has experienced an unrecoverable error

Before proceeding with any repairs, run these diagnostic commands:

# Check SMART status of the affected disk
smartctl -a /dev/daX

# Run a short self-test
smartctl -t short /dev/daX

# View disk health (replace with your actual disk identifier)
gstat -p 1

Yes, you should absolutely initiate a scrub immediately. The scrub will:

  • Verify checksums for all data
  • Attempt to repair any correctable errors
  • Identify if the errors are persistent
# Start a scrub
zpool scrub raid2

# Monitor progress
zpool status -v raid2

After the scrub completes, check the status again:

zpool status -v raid2

Look for these possible outcomes:

  1. If errors cleared: "0 errors" appears for all devices
  2. If errors persist: The error count remains or increases

For persistent errors, consider these steps:

# Check filesystem health
zfs list -t all
zfs get all raid2

# Export/import the pool (if needed)
zpool export raid2
zpool import raid2

Consider replacement if:

  • SMART tests show failing sectors
  • Errors recur after multiple scrubs
  • The disk shows high latency in gstat

Replacement command example:

zpool replace raid2 gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca /dev/newdisk

Set up regular scrubs (weekly recommended):

# Add to crontab
0 3 * * 0 /sbin/zpool scrub raid2

For enterprise environments, consider these additional monitoring tools:

# Install ZFS monitoring utilities
pkg install zfs-stats zfsnap

The zpool status output shows a critical warning:

status: One or more devices has experienced an unrecoverable error.
    An attempt was made to correct the error. Applications are unaffected.

This indicates that device gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca has recorded:

  • 3 read errors
  • 1.13 million write errors
  • 0 checksum errors

First steps to verify disk health:

smartctl -a /dev/adaX  # Replace X with actual disk identifier
zpool scrub raid2      # Initiate data integrity verification

Check these critical SMART attributes after running the command:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0013   100   100   005    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

If SMART shows issues, consider these diagnostic commands:

# Check for physical connection issues:
camcontrol devlist
diskinfo -v /dev/adaX

# Monitor scrub progress:
zpool status -v raid2

# Check for filesystem corruption:
zdb -e -bcsv raid2

Scenario 1: Temporary glitch
If SMART shows clean bill of health:

zpool clear raid2
zpool scrub raid2

Scenario 2: Failing disk
For confirmed hardware issues:

zpool replace raid2 gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca new_disk_id

Add these to your regular maintenance routine:

# Weekly scrub via cron:
0 3 * * 0 /sbin/zpool scrub raid2

# Email alerts configuration:
zpool set failmode=continue raid2
sysrc -f /etc/rc.conf zfs_enable="YES"

Implement proactive monitoring with this Nagios check:

#!/bin/sh
WARN=10
CRIT=50

errors=$(zpool status | grep "errors:" | awk '{print $3}')

if [ "$errors" -ge "$CRIT" ]; then
    echo "CRITICAL: $errors ZFS errors detected"
    exit 2
elif [ "$errors" -ge "$WARN" ]; then
    echo "WARNING: $errors ZFS errors detected"
    exit 1
else
    echo "OK: $errors ZFS errors detected"
    exit 0
fi