Troubleshooting Active-Backup Bonding Failover Issues in RHEL 6.4: Mode 1 Not Switching on Link Failure


11 views

When examining the bonding configuration on this RHEL 6.4 system with Broadcom NetXtreme II NICs, we observe proper bond initialization but failure during the actual failover event:

# Current bond status check
cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:64:f8:ef:60

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:64:f8:ef:62

The following elements must be verified for proper active-backup operation:

  • Update bonding options in ifcfg-bond0:
    BONDING_OPTS="mode=1 miimon=100 primary=eth0 fail_over_mac=1"
  • Verify network manager isn't interfering:
    chkconfig NetworkManager off
    service NetworkManager stop

For HP ProCurve switches, these settings are recommended:

interface 1
   no lacp
   spanning-tree portfast
!
interface 2
   no lacp
   spanning-tree portfast

To monitor bond transitions in real-time:

watch -n 0.5 "cat /proc/net/bonding/bond0 | grep -e 'Active' -e 'MII' -e 'Slave'"

Force a manual failover for testing:

ifdown eth0
sleep 5
ifup eth0

Add these parameters to /etc/modprobe.d/bonding.conf for better debugging:

options bonding max_bonds=2 miimon=100 downdelay=200 updelay=200

Here's a verified working configuration for RHEL 6.4:

# /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
IPADDR=192.168.11.222
NETMASK=255.255.255.0
GATEWAY=192.168.11.1
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS="mode=1 miimon=100 primary=eth0 fail_over_mac=1 use_carrier=0"

After making changes, restart networking:

service network restart
rmmod bonding
modprobe bonding

When working with NIC bonding in RHEL 6.4 (kernel-2.6.32-358.el6), the active-backup (mode=1) configuration appears to initialize correctly but fails to perform failover when the primary interface loses connectivity. The system shows all bonding components as operational through standard diagnostic commands:

# Check bond status
cat /proc/net/bonding/bond0

# Output should show:
Ethernet Channel Bonding Driver: v3.6.0
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

The key indicators of a properly functioning active-backup bond should be:

  • Automatic promotion of backup NIC when primary fails
  • ARP announcements updating the MAC address mapping
  • Proper carrier detection through MII/ETHTOOL

To verify the actual failover behavior, run these diagnostic commands while unplugging the primary NIC:

# Monitor bond events in real-time
tail -f /var/log/messages | grep bond

# Check active slave changes
watch -n 1 cat /proc/net/bonding/bond0 | grep "Active Slave"

# Verify ARP updates (run from another host)
arp -a | grep bond0-ip

From experience with Broadcom BCM5708 NICs on HP hardware, several factors could disrupt failover:

Network Manager Interference

Despite NM_CONTROLLED=yes in ifcfg files, NetworkManager may still interfere. Completely disable it:

service NetworkManager stop
chkconfig NetworkManager off

Switch Port Configuration

Some switches require special port settings for bonding. Verify these ProCurve 1800-8G settings:

interface 1-2
   spanning-tree disable
   no lacp
exit

Driver-Specific Issues

The bnx2 driver may need specific parameters. Create /etc/modprobe.d/bnx2.conf:

options bnx2 disable_msi=0 debug=0x1

Modify your bond0 configuration with these enhanced parameters:

# /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
IPADDR=192.168.11.222
NETMASK=255.255.255.0
GATEWAY=192.168.11.1
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS="mode=1 miimon=100 primary=eth0 fail_over_mac=1 updelay=2000 downdelay=2000 use_carrier=1"

Key parameters explained:

  • fail_over_mac=1: Ensure MAC address changes during failover
  • up/downdelay=2000: Give switches time to update MAC tables
  • use_carrier=1: Better link detection with Broadcom NICs

After implementing these changes, test failover with this procedure:

# Start continuous ping test
ping -I bond0 192.168.11.1

# In another terminal, monitor bond status
watch -n 0.5 'cat /proc/net/bonding/bond0 | grep -E "Active|MII"'

# Physically disconnect eth0 cable
# Should observe:
# 1. Brief ping interruption (1-2 packets)
# 2. Active slave changes to eth1 in watch output
# 3. Ping resumes automatically

For production environments, verify these additional components:

# Check kernel bonding support
grep BONDING /boot/config-$(uname -r)

# Verify module loading order
lsmod | grep -E 'bnx2|bonding'

# Ensure proper initramfs inclusion
dracut -f -v