Understanding ESXi Dual-Boot Mechanism: /bootbank vs /altbootbank Partition Behavior in Failure Scenarios


2 views

VMware ESXi implements a failsafe boot mechanism through its dual-partition structure. The /bootbank partition contains the active hypervisor image, while /altbootbank serves as a fallback copy. This redundancy is implemented at the filesystem level using the vmkfstools utility.

During system initialization, the bootloader follows this sequence:

1. Attempt to load from /bootbank
2. If checksum validation fails, switch to /altbootbank
3. If both partitions fail, enter recovery mode

The selection happens before the kernel loads, making the process transparent to users.

The system updates /altbootbank in these scenarios:

- During ESXi patches/upgrades (auto-sync)
- After successful boot from /altbootbank (prompts for repair)
- Manual intervention via CLI commands

Example CLI command to force synchronization:

vim-cmd hostsvc/maintenance_mode_enter
/usr/lib/vmware/esxcli/bin/esxcli system settings advanced set -o /Misc/AlternateBootBank -i 1
reboot

When booting from the alternate partition, ESXi logs this event in three locations:

/var/log/vmkernel.log
/var/log/boot.gz
dmesg output

You'll see entries like:

WARNING: LINUXBOOT: Booting from alternate bank
NOTICE: BOOTBANK: Primary bank checksum failure (0xbadc0de)

To simulate a boot failure scenario:

# Corrupt primary bank (simulating filesystem error)
dd if=/dev/zero of=/bootbank/vmkernel.gz bs=1k count=100
sync
reboot

The system should automatically fail over to /altbootbank and generate a purple diagnostic screen with the error code APD%20BOOTBANK_CORRUPT.

1. Regularly check partition health:

esxcli system boot device get

2. Verify synchronization status:

vsish -e get /system/bootMode

3. Monitor for auto-repair attempts in:

grep -i bootbank /var/log/syslog.log


VMware ESXi employs a dual-bank boot system where /bootbank serves as the primary boot partition while /altbootbank acts as a failover. This redundancy mechanism ensures system availability even when the primary boot partition becomes corrupted.

The system automatically switches to /altbootbank under these conditions:

  • CRC checksum validation failure in /bootbank
  • Boot loader cannot locate or read the primary partition
  • Kernel panic during boot from primary partition

When failover occurs:

# Check current boot partition
esxcli system boot partition get

# Example output:
#    Current: altbootbank
#    Next Boot: altbootbank
#    Active: True

The /altbootbank isn't static. It gets updated during:

  • ESXi patches and upgrades
  • Successful boots (after 3 consecutive successful boots from primary)
  • Manual sync operations

To force synchronization:

# Sync boot partitions
/sbin/auto-backup.sh

Consider this common troubleshooting sequence when primary boot fails:

# 1. Verify boot attempt history
vim-cmd hostsvc/hosthardware | grep boot

# 2. Compare partition contents (if system boots)
diff -r /bootbank/ /altbootbank/

# 3. Manual recovery if automatic fails
esxcli system boot partition set -p altbootbank
esxcli system shutdown reboot -r "Boot partition recovery"

Implement these checks in your automation scripts:

#!/bin/sh
# Check boot partition health
BOOT_STATUS=$(esxcli system boot partition get | grep Active | awk '{print $2}')

if [ "$BOOT_STATUS" = "False" ]; then
    logger -p user.warn "ESXi booting from alternate partition"
    # Trigger alerting system here
fi

Regular maintenance should include:

  • Monthly partition checksum verification
  • Pre-upgrade partition backups
  • Post-upgrade partition synchronization