I recently encountered an Intel X-25M SSD that was flagged as "failed" in a ZFS array, yet functioned normally when tested externally. This suggests either:
- ZFS's error reporting being overly aggressive with older SSDs
- SMART threshold misalignment
- Controller-level issues not visible at filesystem level
For Intel X-25M series (and most modern SSDs), these SMART attributes matter most:
# Linux smartctl example:
sudo smartctl -A /dev/sda
# Critical attributes to monitor:
5 Reallocated_Sector_Ct
171 Program_Fail_Count
172 Erase_Fail_Count
174 Unexpected_Power_Loss
177 Wear_Leveling_Count
181 Program_Fail_Count_Total
182 Erase_Fail_Count_Total
187 Reported_Uncorrect
Here's a Python script that evaluates SSD health using smartctl output:
import subprocess
import re
def check_ssd_health(device):
try:
output = subprocess.check_output(
['smartctl', '-A', device],
stderr=subprocess.STDOUT
).decode('utf-8')
results = {
'reallocated': int(re.search(r'Reallocated_Sector_Ct\s+\d+\s+\d+\s+\d+\s+(\d+)', output).group(1)),
'wear_leveling': int(re.search(r'Wear_Leveling_Count\s+\d+\s+\d+\s+\d+\s+(\d+)', output).group(1)),
'program_fail': int(re.search(r'Program_Fail_Count_Total\s+\d+\s+\d+\s+\d+\s+(\d+)', output).group(1))
}
health_status = "GOOD"
if results['reallocated'] > 100:
health_status = "WARNING"
if results['wear_leveling'] > 80:
health_status = "CRITICAL"
if results['program_fail'] > 0:
health_status = "FAILING"
return health_status
except Exception as e:
return f"ERROR: {str(e)}"
print(check_ssd_health('/dev/sda'))
When SMART data is inconclusive, perform these additional checks:
# Full surface read test (non-destructive)
sudo badblocks -sv /dev/sdX
# Write performance test (destructive - backup first!)
sudo dd if=/dev/zero of=/dev/sdX bs=1M count=1000 conv=fdatasync
# Compare against manufacturer specs:
Intel X-25M G2 Specs:
- Sequential Read: Up to 250 MB/s
- Sequential Write: Up to 100 MB/s
- 4KB Random Read: Up to 35,000 IOPS
For ZFS arrays, consider these additional factors:
- Check
zpool status -v
for detailed error counts - Examine kernel logs for ATA/SCSI transport errors
- Test with
zpool clear
and monitor recurrence
When an SSD gets flagged as "failed" in a ZFS array but appears functional elsewhere, it's often a sign that we need to look beyond surface-level diagnostics. The case of the Intel X-25M drive you mentioned highlights the importance of proper health assessment tools.
Here are the most reliable methods to check SSD health programmatically:
# Linux smartctl example
sudo smartctl -a /dev/sdX
# Windows PowerShell alternative
Get-PhysicalDisk | Get-StorageReliabilityCounter | Format-List
Key SMART attributes to monitor for Intel SSDs:
- Percentage Used: Current wear level indicator
- Media Wearout Indicator: Counts of program/erase cycles
- Available Reserved Space: Critical for performance
For modern NVMe drives, the nvme-cli tool provides deeper insights:
# Install nvme-cli (Linux)
sudo apt install nvme-cli
# Check SMART log
sudo nvme smart-log /dev/nvme0
Here's a Python script to monitor SSD health programmatically:
import subprocess
import json
def check_ssd_health(device):
try:
output = subprocess.check_output(
f"sudo smartctl -j -a {device}",
shell=True,
stderr=subprocess.STDOUT
)
data = json.loads(output)
if data['smart_status']['passed']:
print(f"{device} health: GOOD")
print(f"Wear Level: {data['nvme_smart_health_information_log']['percentage_used']}%")
else:
print(f"{device} health: WARNING")
except subprocess.CalledProcessError as e:
print(f"Error checking {device}: {e.output.decode()}")
check_ssd_health("/dev/nvme0n1")
Consider replacing your SSD when:
- Reallocated sector count exceeds manufacturer threshold
- Wear leveling indicator reaches 90%+
- Uncorrectable error count shows consistent growth
ZFS has strict error handling. To test if a drive is truly ZFS-worthy:
# Create test zpool
sudo zpool create testpool /dev/sdX
# Run scrub test
sudo zpool scrub testpool
# Check status
sudo zpool status -v testpool
# Destroy test pool
sudo zpool destroy testpool