How to Diagnose SSD Health Programmatically: SMART Tools and Low-Level Checks for Intel X-25M Drives


6 views

I recently encountered an Intel X-25M SSD that was flagged as "failed" in a ZFS array, yet functioned normally when tested externally. This suggests either:

  • ZFS's error reporting being overly aggressive with older SSDs
  • SMART threshold misalignment
  • Controller-level issues not visible at filesystem level

For Intel X-25M series (and most modern SSDs), these SMART attributes matter most:

# Linux smartctl example:
sudo smartctl -A /dev/sda

# Critical attributes to monitor:
5  Reallocated_Sector_Ct
171 Program_Fail_Count
172 Erase_Fail_Count
174 Unexpected_Power_Loss
177 Wear_Leveling_Count
181 Program_Fail_Count_Total
182 Erase_Fail_Count_Total
187 Reported_Uncorrect

Here's a Python script that evaluates SSD health using smartctl output:

import subprocess
import re

def check_ssd_health(device):
    try:
        output = subprocess.check_output(
            ['smartctl', '-A', device],
            stderr=subprocess.STDOUT
        ).decode('utf-8')
        
        results = {
            'reallocated': int(re.search(r'Reallocated_Sector_Ct\s+\d+\s+\d+\s+\d+\s+(\d+)', output).group(1)),
            'wear_leveling': int(re.search(r'Wear_Leveling_Count\s+\d+\s+\d+\s+\d+\s+(\d+)', output).group(1)),
            'program_fail': int(re.search(r'Program_Fail_Count_Total\s+\d+\s+\d+\s+\d+\s+(\d+)', output).group(1))
        }
        
        health_status = "GOOD"
        if results['reallocated'] > 100:
            health_status = "WARNING"
        if results['wear_leveling'] > 80:
            health_status = "CRITICAL"
        if results['program_fail'] > 0:
            health_status = "FAILING"
            
        return health_status
        
    except Exception as e:
        return f"ERROR: {str(e)}"

print(check_ssd_health('/dev/sda'))

When SMART data is inconclusive, perform these additional checks:

# Full surface read test (non-destructive)
sudo badblocks -sv /dev/sdX

# Write performance test (destructive - backup first!)
sudo dd if=/dev/zero of=/dev/sdX bs=1M count=1000 conv=fdatasync

# Compare against manufacturer specs:
Intel X-25M G2 Specs:
- Sequential Read: Up to 250 MB/s
- Sequential Write: Up to 100 MB/s
- 4KB Random Read: Up to 35,000 IOPS

For ZFS arrays, consider these additional factors:

  • Check zpool status -v for detailed error counts
  • Examine kernel logs for ATA/SCSI transport errors
  • Test with zpool clear and monitor recurrence

When an SSD gets flagged as "failed" in a ZFS array but appears functional elsewhere, it's often a sign that we need to look beyond surface-level diagnostics. The case of the Intel X-25M drive you mentioned highlights the importance of proper health assessment tools.

Here are the most reliable methods to check SSD health programmatically:


# Linux smartctl example
sudo smartctl -a /dev/sdX

# Windows PowerShell alternative
Get-PhysicalDisk | Get-StorageReliabilityCounter | Format-List

Key SMART attributes to monitor for Intel SSDs:

  • Percentage Used: Current wear level indicator
  • Media Wearout Indicator: Counts of program/erase cycles
  • Available Reserved Space: Critical for performance

For modern NVMe drives, the nvme-cli tool provides deeper insights:


# Install nvme-cli (Linux)
sudo apt install nvme-cli

# Check SMART log
sudo nvme smart-log /dev/nvme0

Here's a Python script to monitor SSD health programmatically:


import subprocess
import json

def check_ssd_health(device):
    try:
        output = subprocess.check_output(
            f"sudo smartctl -j -a {device}",
            shell=True,
            stderr=subprocess.STDOUT
        )
        data = json.loads(output)
        
        if data['smart_status']['passed']:
            print(f"{device} health: GOOD")
            print(f"Wear Level: {data['nvme_smart_health_information_log']['percentage_used']}%")
        else:
            print(f"{device} health: WARNING")
            
    except subprocess.CalledProcessError as e:
        print(f"Error checking {device}: {e.output.decode()}")

check_ssd_health("/dev/nvme0n1")

Consider replacing your SSD when:

  • Reallocated sector count exceeds manufacturer threshold
  • Wear leveling indicator reaches 90%+
  • Uncorrectable error count shows consistent growth

ZFS has strict error handling. To test if a drive is truly ZFS-worthy:


# Create test zpool
sudo zpool create testpool /dev/sdX

# Run scrub test
sudo zpool scrub testpool

# Check status
sudo zpool status -v testpool

# Destroy test pool
sudo zpool destroy testpool