Before anything else, perform a visual inspection of all drives. Then check SMART attributes using smartctl:
# Install smartmontools if needed
sudo apt install smartmontools
# Check basic SMART info for /dev/sdX
sudo smartctl -i /dev/sdX
# Run short self-test
sudo smartctl -t short /dev/sdX
# Check test results
sudo smartctl -l selftest /dev/sdX
A destructive read-write test is the most thorough way to detect early failures:
# WARNING: This will erase all data!
sudo badblocks -b 4096 -wsv /dev/sdX
# Non-destructive read-only alternative
sudo badblocks -b 4096 -sv /dev/sdX
Schedule extended SMART tests overnight for all drives:
for drive in /dev/sd{b..k}; do
sudo smartctl -t long $drive
done
# Monitor progress (run next day)
for drive in /dev/sd{b..k}; do
sudo smartctl -l selftest $drive | grep -i "test remaining"
done
After basic validation, create a temporary pool for stress testing:
# Create test pool (adjust devices accordingly)
sudo zpool create -f -o ashift=12 testpool raidz2 /dev/sd{b..k}
# Generate random test data
openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" \
-nosalt
Here's a comprehensive test script for multiple drives:
#!/bin/bash
DEVICES=(/dev/sd{b..k})
for device in "${DEVICES[@]}"; do
echo "=== Testing $device ==="
# SMART short test
smartctl -t short $device
sleep 2m
smartctl -l selftest $device | grep "test result"
# Badblocks non-destructive
badblocks -b 4096 -sv $device -o "${device##*/}_badblocks.txt"
# SMART extended test
smartctl -t long $device
echo "Started long test on $device"
done
echo "All tests initiated. Monitor progress with:"
echo "smartctl -l selftest /dev/sdX"
Keep monitoring for at least 72 hours after initial tests:
watch -n 3600 'for d in /dev/sd{b..k}; do \
echo -n "$d: "; \
smartctl -a $d | grep -E "Temperature|Reallocated|Pending|Uncorrectable"; \
done'
Key warning signs to watch for:
- Reallocated sectors > 0
- Pending sectors > 0
- Uncorrectable sectors > 0
- Temperature consistently > 50°C
- Any SMART test failures
- ZFS checksum errors during scrub
When setting up a storage server with multiple new HDDs (especially in a ZFS RAID-Z2 configuration like your 10x2TB WD Red setup), proper pre-deployment testing is crucial. Infant mortality in hard drives follows the "bathtub curve" - failures are most likely either immediately or after years of use. Here's my professional testing protocol:
# Basic SMART quick test
smartctl -t short /dev/sdX
# Extended SMART test (takes hours but thorough)
smartctl -t long /dev/sdX
# Check reallocated sectors count
smartctl -A /dev/sdX | grep Reallocated_Sector_Ct
# Check pending sectors
smartctl -A /dev/sdX | grep Current_Pending_Sector
I recommend running a full read/write cycle using badblocks (destructive test - only for new drives):
badblocks -wsv -b 4096 -t random -o badblocks.log /dev/sdX
This performs:
- 4 passes (-w): write pattern, read verify, write inverse, read verify
- Verbose output (-v) and sector size specification (-b 4096 for 4K sectors)
- Random pattern testing (-t random) which is more thorough than sequential
Once individual drives pass testing, create your ZFS pool with proper ashift:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd sde sdf sdg sdh sdi sdj
Then perform a scrub to verify the entire array:
zpool scrub tank
Here's a bash script I use to automate testing across multiple drives:
#!/bin/bash
for drive in /dev/sd{a..j}; do
echo "Testing $drive..."
smartctl -t short $drive
sleep 2m # Wait for short test completion
smartctl -H $drive | grep "test result" || echo "SMART test failed for $drive"
badblocks -sv -b 4096 -t random -o ${drive##*/}_badblocks.log $drive
done
Run this command to watch SMART attributes during testing:
watch -n 60 'for d in /dev/sd{a..j}; do echo $d; smartctl -A $d | grep -E "Reallocated|Pending|Uncorrectable"; done'
Red flags to watch for:
- Any reallocated sectors (should be 0 on new drives)
- Pending sectors that don't clear after multiple tests
- Rising UDMA CRC errors (could indicate cable issues)
- High seek error rates or spin retries