How to Recover from VMware ESXi Boot Device (USB/SD Card) Failure in a vSphere Cluster


2 views

When an SD card or USB drive hosting VMware ESXi fails, you'll typically encounter symptoms like:

Lost connectivity to the device backing the boot filesystem
Embedded Flash/SD-CARD: Error writing media [X], physical block [Y]: Stack Exception
Host configuration changes not persisting

This isn't just theoretical - I've personally experienced three such failures in production environments. The HP ProLiant DL380p Gen8's ILO logs are particularly good at flagging these issues early.

For a vSphere cluster with SAN storage (the ideal scenario):

  1. Verify VM availability (they should continue running)
  2. Check vCenter for host connection status
  3. Review ILO/iDRAC logs for storage errors
  4. Document current host configuration

Here's the step-by-step recovery process I've standardized:

# First, collect diagnostic data before rebooting
esxcli system syslog config get
esxcli hardware memory get
esxcli storage core device list

Then proceed with:

  1. Place host in maintenance mode
  2. Power down gracefully
  3. Replace failed SD card/USB
  4. Reinstall ESXi using same version

After fresh ESXi install, use PowerCLI to restore settings:

Connect-VIServer -Server vcenter.example.com
$host = Get-VMHost -Name "esxi-host-01"

# Restore network config
Get-VMHostNetwork -VMHost $host | Set-VMHostNetwork -HostName "esxi-host-01" -Domain "corp.local"

# Reapply storage configuration
$hba = Get-VMHostHba -VMHost $host -Type FibreChannel
$hba | Set-VMHostHba -ScanPolicy "All"

Implement these best practices:

  • Use enterprise-grade SD cards (avoid consumer-grade)
  • Enable persistent logging to shared storage
  • Regularly back up host configurations
  • Monitor SD card health via SNMP

Example SNMP monitoring configuration:

# Configure ESXi SNMP for hardware monitoring
esxcli system snmp set --communities "monitoring" --enable true
esxcli system snmp set --targets "snmp.example.com@162/monitoring"

For critical environments, consider:

  • Booting from SAN (FC/iSCSI)
  • Using SATADOM devices
  • Implementing Auto Deploy with stateless caching

Here's a sample Auto Deploy rule:

New-DeployRule -Name "ESXi-7.0-Cluster" -Item (Get-DeployImageProfile "ESXi-7.0.0-xxxxxx-standard") -Pattern "vendor=HP,model=ProLiant DL380p Gen8" -Cluster "Production-Cluster"

When your ESXi host's boot device (USB/SD card) fails, you'll typically encounter symptoms like:

Lost connectivity to the device backing the boot filesystem
Embedded Flash/SD-CARD: Error writing media 0, physical block XXXX: Stack Exception
Host configuration changes not persisting after reboot

For a vSphere cluster with SAN storage, follow this recovery workflow:

# Step 1: Put host in maintenance mode
esxcli system maintenanceMode set --enable true

# Step 2: Verify storage connectivity
esxcli storage core adapter list
esxcli storage core path list

# Step 3: Temporarily preserve configuration
vim-cmd hostsvc/firmware/sync_config
vim-cmd hostsvc/firmware/backup_config

Option A: Replace the failed media

# Create new boot media using VMware CLI
vmkfstools -i /dev/sdX /vmfs/volumes/datastore1/esxi-install.vmdk

Option B: Migrate to more reliable boot options

  • Dual SD cards in RAID 1 (for supported servers)
  • Booting from SAN LUN
  • Internal M.2 SSD with write endurance rating

Create a cron job to regularly backup your ESXi configuration:

# Add to /etc/rc.local.d/local.sh
/bin/auto-backup.sh &

# Script content (/bin/auto-backup.sh)
#!/bin/sh
while true; do
    vim-cmd hostsvc/firmware/backup_config
    sleep 86400
done

Implement these ESXCLI commands to monitor boot device health:

# Check device wear level
esxcli storage core device smart get -d t10.ATA_____Samsung_SSD_860_PRO_1TB_______________

# Monitor IO errors
esxcli system syslog config get
esxcli system syslog mark --message="Boot device health check"