The zpool status
output indicates that one of your devices in the RAIDZ1 vdev has experienced both read (3 errors) and write (1.13 million errors) issues. The key lines to focus on are:
gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca ONLINE 3 1.13M 0
status: One or more devices has experienced an unrecoverable error
Before proceeding with any repairs, run these diagnostic commands:
# Check SMART status of the affected disk
smartctl -a /dev/daX
# Run a short self-test
smartctl -t short /dev/daX
# View disk health (replace with your actual disk identifier)
gstat -p 1
Yes, you should absolutely initiate a scrub immediately. The scrub will:
- Verify checksums for all data
- Attempt to repair any correctable errors
- Identify if the errors are persistent
# Start a scrub
zpool scrub raid2
# Monitor progress
zpool status -v raid2
After the scrub completes, check the status again:
zpool status -v raid2
Look for these possible outcomes:
- If errors cleared: "0 errors" appears for all devices
- If errors persist: The error count remains or increases
For persistent errors, consider these steps:
# Check filesystem health
zfs list -t all
zfs get all raid2
# Export/import the pool (if needed)
zpool export raid2
zpool import raid2
Consider replacement if:
- SMART tests show failing sectors
- Errors recur after multiple scrubs
- The disk shows high latency in
gstat
Replacement command example:
zpool replace raid2 gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca /dev/newdisk
Set up regular scrubs (weekly recommended):
# Add to crontab
0 3 * * 0 /sbin/zpool scrub raid2
For enterprise environments, consider these additional monitoring tools:
# Install ZFS monitoring utilities
pkg install zfs-stats zfsnap
The zpool status
output shows a critical warning:
status: One or more devices has experienced an unrecoverable error.
An attempt was made to correct the error. Applications are unaffected.
This indicates that device gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca
has recorded:
- 3 read errors
- 1.13 million write errors
- 0 checksum errors
First steps to verify disk health:
smartctl -a /dev/adaX # Replace X with actual disk identifier
zpool scrub raid2 # Initiate data integrity verification
Check these critical SMART attributes after running the command:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0013 100 100 005 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
If SMART shows issues, consider these diagnostic commands:
# Check for physical connection issues:
camcontrol devlist
diskinfo -v /dev/adaX
# Monitor scrub progress:
zpool status -v raid2
# Check for filesystem corruption:
zdb -e -bcsv raid2
Scenario 1: Temporary glitch
If SMART shows clean bill of health:
zpool clear raid2
zpool scrub raid2
Scenario 2: Failing disk
For confirmed hardware issues:
zpool replace raid2 gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca new_disk_id
Add these to your regular maintenance routine:
# Weekly scrub via cron:
0 3 * * 0 /sbin/zpool scrub raid2
# Email alerts configuration:
zpool set failmode=continue raid2
sysrc -f /etc/rc.conf zfs_enable="YES"
Implement proactive monitoring with this Nagios check:
#!/bin/sh
WARN=10
CRIT=50
errors=$(zpool status | grep "errors:" | awk '{print $3}')
if [ "$errors" -ge "$CRIT" ]; then
echo "CRITICAL: $errors ZFS errors detected"
exit 2
elif [ "$errors" -ge "$WARN" ]; then
echo "WARNING: $errors ZFS errors detected"
exit 1
else
echo "OK: $errors ZFS errors detected"
exit 0
fi