When your monitoring server logged these kernel messages, it revealed several critical hardware-level issues:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: failed command: READ DMA
res 51/40:9f:41:68:35/00:00:00:00:00/e0 Emask 0x9 (media error)
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 3500097
Let's decode the most significant error indicators:
- UNC (Uncorrectable Error): The drive encountered data it couldn't correct using ECC
- BMDMA stat 0x24: Indicates a DMA transfer error during read operation
- Medium Error: Physical media surface problem at LBA 3500097
- Auto reallocate failed: The drive's spare sectors are exhausted
Before replacing the drive, gather forensic data:
# Check SMART attributes
smartctl -a /dev/sda
# Force offline testing
smartctl -t offline /dev/sda
# Check reallocated sector count
smartctl -A /dev/sda | grep -E "Reallocated_Sector|Pending_Sector"
# Get full error log
smartctl -l error /dev/sda
A healthy drive should show:
Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
Your failing drive likely shows non-zero values for these attributes, indicating physical media degradation.
For SCSI/SATA devices, deeper inspection is possible:
# Check defect management
sdparm --command=log --page=defects /dev/sda
# View device capabilities
sdparm --inquiry /dev/sda
# Get transport protocol errors
sdparm --transport /dev/sda
Enable additional debugging if the issue persists with replacement drives:
# Increase ATA debug level
echo 8 > /sys/kernel/debug/tracing/events/ata/ata_eh/enable
# Capture full error trace
dmesg -wH | tee ata_errors.log
Implement this Python script to monitor disk health:
#!/usr/bin/env python3
import subprocess
import json
def check_disk_health(device):
result = subprocess.run(
['smartctl', '-j', '-a', device],
capture_output=True,
text=True
)
return json.loads(result.stdout)
if __name__ == "__main__":
disk = "/dev/sda"
data = check_disk_health(disk)
if data['smart_status']['passed']:
print(f"Disk {disk} healthy")
else:
print(f"ALERT: Disk {disk} failing!")
print(f"Reallocated sectors: {data['ata_smart_attributes']['table'][5]['raw']['value']}")
After drive replacement, analyze the old drive's behavior patterns:
# Extract all disk-related kernel messages
journalctl -k -b | grep -E 'ata|sd' > disk_errors_full.log
# Generate sector error map
badblocks -v /dev/sda > bad_sectors.txt
When your Linux system reports disk errors like these, it's essentially telling you a story about failing hardware communication. Let's dissect the key components:
Jul 11 23:52:30 monit kernel: [ 25.255908] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 11 23:52:30 monit kernel: [ 25.256410] ata1.00: cmd c8/00:c0:20:68:35/00:00:00:00:00/e0 tag 0 dma 98304 in
Jul 11 23:52:30 monit kernel: [ 25.256416] res 51/40:9f:41:68:35/00:00:00:00:00/e0 Emask 0x9 (media error)
Jul 11 23:52:30 monit kernel: [ 25.256933] ata1.00: error: { UNC }
UNC (Uncorrectable Error): The drive encountered data it couldn't read or correct using its ECC (Error Correction Code). This is serious - the sector is fundamentally unreadable.
Media Error: The physical storage medium (platter surface for HDDs, NAND cells for SSDs) has developed faults that prevent reliable data reading.
DMA Read Failure: The system attempted to read data via Direct Memory Access (fast hardware transfer) but the operation failed at the physical level.
The log shows the system tried to read sector 3500097 (hex 0x356841) but the drive reported:
Jul 11 23:52:30 monit kernel: [ 25.552543] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
This means:
- The drive detected a bad sector
- It tried to automatically reallocate it to a spare sector
- The reallocation failed (often because no spare sectors remain)
Before replacing the drive, you should confirm its SMART status:
# Install smartmontools if needed
sudo apt install smartmontools
# Check SMART attributes
sudo smartctl -a /dev/sda
Key indicators of failure:
5 Reallocated_Sector_Ct 0x0033 001 001 036 Pre-fail FAILING_NOW
197 Current_Pending_Sector 0x0012 099 099 000 Old_age FAILING_NOW
198 Offline_Uncorrectable 0x0010 099 099 000 Old_age FAILING_NOW
If you need to recover data before replacement, try forcing a read with ddrescue:
sudo apt install gddrescue
sudo ddrescue -d -r3 /dev/sda /mnt/backup/image.img /mnt/backup/logfile.log
Or mark the bad block in filesystem:
# For ext4:
sudo debugfs -w /dev/sda1
debugfs: icheck 3500097
debugfs: ncheck <inode_number>
debugfs: clri <inode_number>
debugfs: quit
These errors indicate physical media degradation. Even if the drive appears to work temporarily, it will:
- Develop more bad sectors
- Risk complete failure during write operations
- Potentially corrupt files during transfers