The SMART (Self-Monitoring, Analysis, and Reporting Technology) self-test log shows critical information about your drive's health. In your case, /dev/sde
reports multiple "Completed: read failure" entries:
# smartctl -l selftest /dev/sde
...
# 1 Extended offline Completed: read failure 90% 8981 976642822
# 3 Extended offline Completed: read failure 90% 8981 976642822
This is significantly different from healthy drives that show "Completed without error". The repeated failures at the same LBA (Logical Block Address) 976642822 strongly suggest physical media degradation.
While the drive might appear functional now, these attributes deserve attention:
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
The presence of even 1 pending sector (197) means your drive has identified sectors it can't reliably read but hasn't yet reallocated them.
Run these commands to gather more evidence:
# Check overall health
smartctl -H /dev/sde
# Get full attribute details
smartctl -A /dev/sde
# Attempt to read the problematic LBA
dd if=/dev/sde bs=512 count=1 skip=976642822 of=/tmp/test_lba
If you need to recover data from this potentially failing drive:
# Create a raw image first (safer than direct copying)
ddrescue -d -r3 /dev/sde /mnt/safe_storage/sde.img /mnt/safe_storage/sde.log
# Then mount the image for recovery
losetup -fP /mnt/safe_storage/sde.img
mount /dev/loop0p1 /mnt/recovery
Given these symptoms:
- Multiple read failures in extended tests
- Consistent failure at specific LBAs
- 8,981 power-on hours (moderate-high usage)
You should consider replacing this drive immediately for any critical storage needs.
For better monitoring, consider this Python script to track SMART trends:
import subprocess
import time
def monitor_smart(device):
while True:
result = subprocess.run(
["smartctl", "-A", device],
capture_output=True,
text=True
)
print(f"SMART data at {time.ctime()}:")
print(result.stdout)
time.sleep(3600) # Check hourly
monitor_smart("/dev/sde")
The SMART (Self-Monitoring, Analysis and Reporting Technology) output clearly shows repeating patterns of read failures during extended offline tests on /dev/sde:
# 1 Extended offline Completed: read failure 90% 8981 976642822
# 3 Extended offline Completed: read failure 90% 8981 976642822
Key observations from the test log:
- Consistent failure at 90% completion (critical threshold)
- Identical LBA (976642822) failing across multiple tests
- 8,981 power-on hours (moderate drive age)
- Host-aborted and interrupted tests suggest I/O stability issues
The healthy /dev/sdc shows contrasting patterns:
# 2 Extended offline Completed without error 00% 9431 -
# 3 Extended offline Completed without error 00% 8368 -
Notable differences:
- 0% remaining indicates full test completion
- No LBA errors recorded
- Clean attribute values (Reallocated_Sector_Ct=0, Current_Pending_Sector=1)
These attributes deserve immediate attention:
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
A single pending sector (RAW_VALUE=1) suggests the drive is attempting to reallocate a bad sector. This correlates with the consistent read failure at LBA 976642822.
Run these commands to gather more evidence:
# Check raw SMART data
smartctl -x /dev/sde
# Test specific problematic LBA
hdparm --read-sector 976642822 /dev/sde
# Monitor in real-time
smartctl -d ata -A -f brief /dev/sde
For the affected LBA sector:
# Attempt sector reallocation
hdparm --repair-sector 976642822 --yes-i-know-what-i-am-doing /dev/sde
# Force remapping (if supported)
smartctl -t select,976642822-976642822 /dev/sde
Create a cron job with this bash script:
#!/bin/bash
FAILING_LBA=976642822
THRESHOLD=3
LOG="/var/log/drive_health.log"
count=$(smartctl -l selftest /dev/sde | grep -c "$FAILING_LBA")
if [ "$count" -ge "$THRESHOLD" ]; then
echo "$(date) - Critical: LBA $FAILING_LBA failed $count times" >> $LOG
# Add notification logic here
fi
Immediate replacement is recommended when:
- Pending sector count increases
- Read failures occur in new LBAs
- Reallocated sector count rises
- Multiple extended tests fail consecutively
For enterprise environments, consider establishing these thresholds:
# Zabbix trigger example
{vfs.dev.smart[all,/dev/sde].reallocated_sectors.count.last()}>10
or
{vfs.dev.smart[all,/dev/sde].pending_sectors.count.last()}>5