Diagnosing and Troubleshooting ATA/SATA Disk Media Errors (UNC, BMDMA Stat 0x24) in Linux Kernel Logs


2 views

When your monitoring server logged these kernel messages, it revealed several critical hardware-level issues:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: failed command: READ DMA
res 51/40:9f:41:68:35/00:00:00:00:00/e0 Emask 0x9 (media error)
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 3500097

Let's decode the most significant error indicators:

  • UNC (Uncorrectable Error): The drive encountered data it couldn't correct using ECC
  • BMDMA stat 0x24: Indicates a DMA transfer error during read operation
  • Medium Error: Physical media surface problem at LBA 3500097
  • Auto reallocate failed: The drive's spare sectors are exhausted

Before replacing the drive, gather forensic data:

# Check SMART attributes
smartctl -a /dev/sda

# Force offline testing
smartctl -t offline /dev/sda

# Check reallocated sector count
smartctl -A /dev/sda | grep -E "Reallocated_Sector|Pending_Sector"

# Get full error log
smartctl -l error /dev/sda

A healthy drive should show:

Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

Your failing drive likely shows non-zero values for these attributes, indicating physical media degradation.

For SCSI/SATA devices, deeper inspection is possible:

# Check defect management
sdparm --command=log --page=defects /dev/sda

# View device capabilities
sdparm --inquiry /dev/sda

# Get transport protocol errors
sdparm --transport /dev/sda

Enable additional debugging if the issue persists with replacement drives:

# Increase ATA debug level
echo 8 > /sys/kernel/debug/tracing/events/ata/ata_eh/enable

# Capture full error trace
dmesg -wH | tee ata_errors.log

Implement this Python script to monitor disk health:

#!/usr/bin/env python3
import subprocess
import json

def check_disk_health(device):
    result = subprocess.run(
        ['smartctl', '-j', '-a', device],
        capture_output=True,
        text=True
    )
    return json.loads(result.stdout)

if __name__ == "__main__":
    disk = "/dev/sda"
    data = check_disk_health(disk)
    
    if data['smart_status']['passed']:
        print(f"Disk {disk} healthy")
    else:
        print(f"ALERT: Disk {disk} failing!")
        print(f"Reallocated sectors: {data['ata_smart_attributes']['table'][5]['raw']['value']}")

After drive replacement, analyze the old drive's behavior patterns:

# Extract all disk-related kernel messages
journalctl -k -b | grep -E 'ata|sd' > disk_errors_full.log

# Generate sector error map
badblocks -v /dev/sda > bad_sectors.txt

When your Linux system reports disk errors like these, it's essentially telling you a story about failing hardware communication. Let's dissect the key components:

Jul 11 23:52:30 monit kernel: [   25.255908] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 11 23:52:30 monit kernel: [   25.256410] ata1.00: cmd c8/00:c0:20:68:35/00:00:00:00:00/e0 tag 0 dma 98304 in
Jul 11 23:52:30 monit kernel: [   25.256416]          res 51/40:9f:41:68:35/00:00:00:00:00/e0 Emask 0x9 (media error)
Jul 11 23:52:30 monit kernel: [   25.256933] ata1.00: error: { UNC }

UNC (Uncorrectable Error): The drive encountered data it couldn't read or correct using its ECC (Error Correction Code). This is serious - the sector is fundamentally unreadable.

Media Error: The physical storage medium (platter surface for HDDs, NAND cells for SSDs) has developed faults that prevent reliable data reading.

DMA Read Failure: The system attempted to read data via Direct Memory Access (fast hardware transfer) but the operation failed at the physical level.

The log shows the system tried to read sector 3500097 (hex 0x356841) but the drive reported:

Jul 11 23:52:30 monit kernel: [   25.552543] sd 0:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocate failed

This means:

  1. The drive detected a bad sector
  2. It tried to automatically reallocate it to a spare sector
  3. The reallocation failed (often because no spare sectors remain)

Before replacing the drive, you should confirm its SMART status:

# Install smartmontools if needed
sudo apt install smartmontools

# Check SMART attributes
sudo smartctl -a /dev/sda

Key indicators of failure:

5 Reallocated_Sector_Ct   0x0033   001   001   036    Pre-fail  FAILING_NOW
197 Current_Pending_Sector  0x0012   099   099   000    Old_age   FAILING_NOW
198 Offline_Uncorrectable   0x0010   099   099   000    Old_age   FAILING_NOW

If you need to recover data before replacement, try forcing a read with ddrescue:

sudo apt install gddrescue
sudo ddrescue -d -r3 /dev/sda /mnt/backup/image.img /mnt/backup/logfile.log

Or mark the bad block in filesystem:

# For ext4:
sudo debugfs -w /dev/sda1
debugfs: icheck 3500097
debugfs: ncheck <inode_number>
debugfs: clri <inode_number>
debugfs: quit

These errors indicate physical media degradation. Even if the drive appears to work temporarily, it will:

  • Develop more bad sectors
  • Risk complete failure during write operations
  • Potentially corrupt files during transfers