Diagnosing and Resolving SCSI MEDIUM ERRORs in ESXi: ZFS Data Corruption and Disk Failure Analysis

When working with ESXi 4.1.0 and ZFS storage, encountering SCSI MEDIUM ERRORs with "Unrecovered read error - auto reallocate failed" messages typically indicates serious storage issues. The error pattern suggests:

(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4

First check the ESXi logs through SSH:

# tail -n 100 /var/log/vmkernel.log | grep -i "error\|fail"
# esxcli storage core device list
# esxcli storage core device stats get -d mpx.vmhba1:C0:T1:L0

For SMART data (if supported):

# esxcli storage core device smart get -d mpx.vmhba1:C0:T1:L0

After rebooting, ZFS reports corruption in specific files. To verify integrity:

# zpool scrub backup
# zpool status -v backup
# zdb -l /dev/da1

Check storage adapter health in vSphere Client:

Navigate to Host > Configuration > Storage Adapters
Review the mpt0 adapter status
Check for any storage path failures

Verify datastore health:

# vim-cmd hostsvc/storage/device_info
# esxcli storage filesystem list

Immediate steps:

Backup all accessible data immediately
Replace the suspect SATA drive
Consider upgrading to a more recent ESXi version (4.1 is EOL)

For ZFS recovery:

# zpool clear backup
# zpool replace backup da1 new_device
# zfs send/recv to migrate to new storage

Implement regular storage monitoring:

# Add to crontab:
0 * * * * esxcli storage core device stats get -d mpx.vmhba1:C0:T1:L0 >> /var/log/device_stats.log
0 3 * * * zpool scrub backup

Consider these ESXi advanced settings for better error handling:

Disk.DiskRetryTimeout = 30
SCSI.DevicePollDisable = 0
SCSI.DevicePollPeriod = 30000

When working with ESXi 4.1.0 environments using ZFS-backed datastores, the following error pattern indicates potential disk failure:

(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4

To verify disk health in ESXi when encountering such errors:

Check ESXi logs via SSH:

esxcli system syslog config get
vim-cmd hostsvc/hosthardware | grep -A10 storage

Verify SMART status (if supported):

esxcli storage core device smart get -d naa.*

Monitor I/O performance:

esxtop
# Press 'd' for disk view
# Watch for high DAVG/cmd and KAVG/cmd values

When ZFS reports corruption after ESXi disk errors:

# First attempt repair:
zpool scrub backup

# If errors persist, export/import:
zpool export backup
zpool import backup

# For severe cases, consider:
zpool clear backup
zpool replace backup da1 [new_device]

For VMs hanging during disk operations:

Check VMX configuration for disk settings:

scsi0:0.present = "TRUE"
scsi0:0.fileName = "backup.vmdk"
scsi0:0.mode = "independent-persistent"

Force reset the VM disk from ESX CLI:

vim-cmd vmsvc/getallvms | grep [VM_NAME]
vim-cmd vmsvc/device.diskremove [VM_ID] scsi0:1

Create an ESXi cron job to monitor disk health:

cat > /etc/rc.local.d/daily_diskcheck.sh << 'EOF'
#!/bin/sh
LOG=/var/log/disk_health.log
esxcli storage core device list | grep -E 'Is Offline|Is Dead' >> $LOG
esxcli storage core device stats get | grep -E 'errors|failures' >> $LOG
EOF
chmod +x /etc/rc.local.d/daily_diskcheck.sh

For confirmed failing disks:

Put host in maintenance mode:

esxcli system maintenanceMode set --enable true

Physically replace the disk

Rescan storage:

esxcli storage core adapter rescan --all

Recreate datastore if necessary:

esxcli storage filesystem list
vim-cmd hostsvc/datastore/destroy [datastore_name]

ServerDevWorker

Diagnosing and Resolving SCSI MEDIUM ERRORs in ESXi: ZFS Data Corruption and Disk Failure Analysis

Related Articles