Diagnosing and Resolving SCSI MEDIUM ERRORs in ESXi: ZFS Data Corruption and Disk Failure Analysis


2 views

When working with ESXi 4.1.0 and ZFS storage, encountering SCSI MEDIUM ERRORs with "Unrecovered read error - auto reallocate failed" messages typically indicates serious storage issues. The error pattern suggests:

(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4

First check the ESXi logs through SSH:

# tail -n 100 /var/log/vmkernel.log | grep -i "error\|fail"
# esxcli storage core device list
# esxcli storage core device stats get -d mpx.vmhba1:C0:T1:L0

For SMART data (if supported):

# esxcli storage core device smart get -d mpx.vmhba1:C0:T1:L0

After rebooting, ZFS reports corruption in specific files. To verify integrity:

# zpool scrub backup
# zpool status -v backup
# zdb -l /dev/da1

Check storage adapter health in vSphere Client:

  1. Navigate to Host > Configuration > Storage Adapters
  2. Review the mpt0 adapter status
  3. Check for any storage path failures

Verify datastore health:

# vim-cmd hostsvc/storage/device_info
# esxcli storage filesystem list

Immediate steps:

  • Backup all accessible data immediately
  • Replace the suspect SATA drive
  • Consider upgrading to a more recent ESXi version (4.1 is EOL)

For ZFS recovery:

# zpool clear backup
# zpool replace backup da1 new_device
# zfs send/recv to migrate to new storage

Implement regular storage monitoring:

# Add to crontab:
0 * * * * esxcli storage core device stats get -d mpx.vmhba1:C0:T1:L0 >> /var/log/device_stats.log
0 3 * * * zpool scrub backup

Consider these ESXi advanced settings for better error handling:

Disk.DiskRetryTimeout = 30
SCSI.DevicePollDisable = 0
SCSI.DevicePollPeriod = 30000

When working with ESXi 4.1.0 environments using ZFS-backed datastores, the following error pattern indicates potential disk failure:

(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4

To verify disk health in ESXi when encountering such errors:

  1. Check ESXi logs via SSH:
    esxcli system syslog config get
    vim-cmd hostsvc/hosthardware | grep -A10 storage
  2. Verify SMART status (if supported):
    esxcli storage core device smart get -d naa.*
  3. Monitor I/O performance:
    esxtop
    # Press 'd' for disk view
    # Watch for high DAVG/cmd and KAVG/cmd values

When ZFS reports corruption after ESXi disk errors:

# First attempt repair:
zpool scrub backup

# If errors persist, export/import:
zpool export backup
zpool import backup

# For severe cases, consider:
zpool clear backup
zpool replace backup da1 [new_device]

For VMs hanging during disk operations:

  • Check VMX configuration for disk settings:
    scsi0:0.present = "TRUE"
    scsi0:0.fileName = "backup.vmdk"
    scsi0:0.mode = "independent-persistent"
  • Force reset the VM disk from ESX CLI:
    vim-cmd vmsvc/getallvms | grep [VM_NAME]
    vim-cmd vmsvc/device.diskremove [VM_ID] scsi0:1

Create an ESXi cron job to monitor disk health:

cat > /etc/rc.local.d/daily_diskcheck.sh << 'EOF'
#!/bin/sh
LOG=/var/log/disk_health.log
esxcli storage core device list | grep -E 'Is Offline|Is Dead' >> $LOG
esxcli storage core device stats get | grep -E 'errors|failures' >> $LOG
EOF
chmod +x /etc/rc.local.d/daily_diskcheck.sh

For confirmed failing disks:

  1. Put host in maintenance mode:
    esxcli system maintenanceMode set --enable true
  2. Physically replace the disk
  3. Rescan storage:
    esxcli storage core adapter rescan --all
  4. Recreate datastore if necessary:
    esxcli storage filesystem list
    vim-cmd hostsvc/datastore/destroy [datastore_name]