Best Methods to Stress Test New WD RED HDDs for ZFS RAID-Z2 Storage Server Deployment


1 views

Before anything else, perform a visual inspection of all drives. Then check SMART attributes using smartctl:


# Install smartmontools if needed
sudo apt install smartmontools

# Check basic SMART info for /dev/sdX
sudo smartctl -i /dev/sdX

# Run short self-test
sudo smartctl -t short /dev/sdX

# Check test results
sudo smartctl -l selftest /dev/sdX

A destructive read-write test is the most thorough way to detect early failures:


# WARNING: This will erase all data!
sudo badblocks -b 4096 -wsv /dev/sdX

# Non-destructive read-only alternative
sudo badblocks -b 4096 -sv /dev/sdX

Schedule extended SMART tests overnight for all drives:


for drive in /dev/sd{b..k}; do
  sudo smartctl -t long $drive
done

# Monitor progress (run next day)
for drive in /dev/sd{b..k}; do
  sudo smartctl -l selftest $drive | grep -i "test remaining"
done

After basic validation, create a temporary pool for stress testing:


# Create test pool (adjust devices accordingly)
sudo zpool create -f -o ashift=12 testpool raidz2 /dev/sd{b..k}

# Generate random test data
openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" \
  -nosalt 

Here's a comprehensive test script for multiple drives:


#!/bin/bash

DEVICES=(/dev/sd{b..k})

for device in "${DEVICES[@]}"; do
  echo "=== Testing $device ==="
  
  # SMART short test
  smartctl -t short $device
  sleep 2m
  smartctl -l selftest $device | grep "test result"
  
  # Badblocks non-destructive
  badblocks -b 4096 -sv $device -o "${device##*/}_badblocks.txt"
  
  # SMART extended test
  smartctl -t long $device
  echo "Started long test on $device"
done

echo "All tests initiated. Monitor progress with:"
echo "smartctl -l selftest /dev/sdX"

Keep monitoring for at least 72 hours after initial tests:


watch -n 3600 'for d in /dev/sd{b..k}; do \
  echo -n "$d: "; \
  smartctl -a $d | grep -E "Temperature|Reallocated|Pending|Uncorrectable"; \
done'

Key warning signs to watch for:

  • Reallocated sectors > 0
  • Pending sectors > 0
  • Uncorrectable sectors > 0
  • Temperature consistently > 50°C
  • Any SMART test failures
  • ZFS checksum errors during scrub

When setting up a storage server with multiple new HDDs (especially in a ZFS RAID-Z2 configuration like your 10x2TB WD Red setup), proper pre-deployment testing is crucial. Infant mortality in hard drives follows the "bathtub curve" - failures are most likely either immediately or after years of use. Here's my professional testing protocol:

# Basic SMART quick test
smartctl -t short /dev/sdX

# Extended SMART test (takes hours but thorough)
smartctl -t long /dev/sdX

# Check reallocated sectors count
smartctl -A /dev/sdX | grep Reallocated_Sector_Ct

# Check pending sectors
smartctl -A /dev/sdX | grep Current_Pending_Sector

I recommend running a full read/write cycle using badblocks (destructive test - only for new drives):

badblocks -wsv -b 4096 -t random -o badblocks.log /dev/sdX

This performs:

  • 4 passes (-w): write pattern, read verify, write inverse, read verify
  • Verbose output (-v) and sector size specification (-b 4096 for 4K sectors)
  • Random pattern testing (-t random) which is more thorough than sequential

Once individual drives pass testing, create your ZFS pool with proper ashift:

zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd sde sdf sdg sdh sdi sdj

Then perform a scrub to verify the entire array:

zpool scrub tank

Here's a bash script I use to automate testing across multiple drives:

#!/bin/bash
for drive in /dev/sd{a..j}; do
  echo "Testing $drive..."
  smartctl -t short $drive
  sleep 2m  # Wait for short test completion
  smartctl -H $drive | grep "test result" || echo "SMART test failed for $drive"
  badblocks -sv -b 4096 -t random -o ${drive##*/}_badblocks.log $drive
done

Run this command to watch SMART attributes during testing:

watch -n 60 'for d in /dev/sd{a..j}; do echo $d; smartctl -A $d | grep -E "Reallocated|Pending|Uncorrectable"; done'

Red flags to watch for:

  • Any reallocated sectors (should be 0 on new drives)
  • Pending sectors that don't clear after multiple tests
  • Rising UDMA CRC errors (could indicate cable issues)
  • High seek error rates or spin retries