Interpreting SMART Self-Test “Completed: Read Failure” – Is Your Drive Failing?


2 views

The SMART (Self-Monitoring, Analysis, and Reporting Technology) self-test log shows critical information about your drive's health. In your case, /dev/sde reports multiple "Completed: read failure" entries:

# smartctl -l selftest /dev/sde
...
# 1  Extended offline    Completed: read failure       90%      8981         976642822
# 3  Extended offline    Completed: read failure       90%      8981         976642822

This is significantly different from healthy drives that show "Completed without error". The repeated failures at the same LBA (Logical Block Address) 976642822 strongly suggest physical media degradation.

While the drive might appear functional now, these attributes deserve attention:

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

The presence of even 1 pending sector (197) means your drive has identified sectors it can't reliably read but hasn't yet reallocated them.

Run these commands to gather more evidence:

# Check overall health
smartctl -H /dev/sde

# Get full attribute details
smartctl -A /dev/sde

# Attempt to read the problematic LBA
dd if=/dev/sde bs=512 count=1 skip=976642822 of=/tmp/test_lba

If you need to recover data from this potentially failing drive:

# Create a raw image first (safer than direct copying)
ddrescue -d -r3 /dev/sde /mnt/safe_storage/sde.img /mnt/safe_storage/sde.log

# Then mount the image for recovery
losetup -fP /mnt/safe_storage/sde.img
mount /dev/loop0p1 /mnt/recovery

Given these symptoms:

  • Multiple read failures in extended tests
  • Consistent failure at specific LBAs
  • 8,981 power-on hours (moderate-high usage)

You should consider replacing this drive immediately for any critical storage needs.

For better monitoring, consider this Python script to track SMART trends:

import subprocess
import time

def monitor_smart(device):
    while True:
        result = subprocess.run(
            ["smartctl", "-A", device],
            capture_output=True,
            text=True
        )
        print(f"SMART data at {time.ctime()}:")
        print(result.stdout)
        time.sleep(3600)  # Check hourly

monitor_smart("/dev/sde")

The SMART (Self-Monitoring, Analysis and Reporting Technology) output clearly shows repeating patterns of read failures during extended offline tests on /dev/sde:

# 1 Extended offline Completed: read failure 90% 8981 976642822
# 3 Extended offline Completed: read failure 90% 8981 976642822

Key observations from the test log:

  • Consistent failure at 90% completion (critical threshold)
  • Identical LBA (976642822) failing across multiple tests
  • 8,981 power-on hours (moderate drive age)
  • Host-aborted and interrupted tests suggest I/O stability issues

The healthy /dev/sdc shows contrasting patterns:

# 2 Extended offline Completed without error 00% 9431 -
# 3 Extended offline Completed without error 00% 8368 -

Notable differences:

  • 0% remaining indicates full test completion
  • No LBA errors recorded
  • Clean attribute values (Reallocated_Sector_Ct=0, Current_Pending_Sector=1)

These attributes deserve immediate attention:

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

A single pending sector (RAW_VALUE=1) suggests the drive is attempting to reallocate a bad sector. This correlates with the consistent read failure at LBA 976642822.

Run these commands to gather more evidence:

# Check raw SMART data
smartctl -x /dev/sde

# Test specific problematic LBA
hdparm --read-sector 976642822 /dev/sde

# Monitor in real-time
smartctl -d ata -A -f brief /dev/sde

For the affected LBA sector:

# Attempt sector reallocation
hdparm --repair-sector 976642822 --yes-i-know-what-i-am-doing /dev/sde

# Force remapping (if supported)
smartctl -t select,976642822-976642822 /dev/sde

Create a cron job with this bash script:

#!/bin/bash
FAILING_LBA=976642822
THRESHOLD=3
LOG="/var/log/drive_health.log"

count=$(smartctl -l selftest /dev/sde | grep -c "$FAILING_LBA")
if [ "$count" -ge "$THRESHOLD" ]; then
    echo "$(date) - Critical: LBA $FAILING_LBA failed $count times" >> $LOG
    # Add notification logic here
fi

Immediate replacement is recommended when:

  • Pending sector count increases
  • Read failures occur in new LBAs
  • Reallocated sector count rises
  • Multiple extended tests fail consecutively

For enterprise environments, consider establishing these thresholds:

# Zabbix trigger example
{vfs.dev.smart[all,/dev/sde].reallocated_sectors.count.last()}>10
or
{vfs.dev.smart[all,/dev/sde].pending_sectors.count.last()}>5