Troubleshooting Active-Backup Bonding Failover Issues in RHEL 6.4: Mode 1 Not Switching on Link Failure


2 views

When examining the bonding configuration on this RHEL 6.4 system with Broadcom NetXtreme II NICs, we observe proper bond initialization but failure during the actual failover event:

# Current bond status check
cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:64:f8:ef:60

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:64:f8:ef:62

The following elements must be verified for proper active-backup operation:

  • Update bonding options in ifcfg-bond0:
    BONDING_OPTS="mode=1 miimon=100 primary=eth0 fail_over_mac=1"
  • Verify network manager isn't interfering:
    chkconfig NetworkManager off
    service NetworkManager stop

For HP ProCurve switches, these settings are recommended:

interface 1
   no lacp
   spanning-tree portfast
!
interface 2
   no lacp
   spanning-tree portfast

To monitor bond transitions in real-time:

watch -n 0.5 "cat /proc/net/bonding/bond0 | grep -e 'Active' -e 'MII' -e 'Slave'"

Force a manual failover for testing:

ifdown eth0
sleep 5
ifup eth0

Add these parameters to /etc/modprobe.d/bonding.conf for better debugging:

options bonding max_bonds=2 miimon=100 downdelay=200 updelay=200

Here's a verified working configuration for RHEL 6.4:

# /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
IPADDR=192.168.11.222
NETMASK=255.255.255.0
GATEWAY=192.168.11.1
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS="mode=1 miimon=100 primary=eth0 fail_over_mac=1 use_carrier=0"

After making changes, restart networking:

service network restart
rmmod bonding
modprobe bonding

When working with NIC bonding in RHEL 6.4 (kernel-2.6.32-358.el6), the active-backup (mode=1) configuration appears to initialize correctly but fails to perform failover when the primary interface loses connectivity. The system shows all bonding components as operational through standard diagnostic commands:

# Check bond status
cat /proc/net/bonding/bond0

# Output should show:
Ethernet Channel Bonding Driver: v3.6.0
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

The key indicators of a properly functioning active-backup bond should be:

  • Automatic promotion of backup NIC when primary fails
  • ARP announcements updating the MAC address mapping
  • Proper carrier detection through MII/ETHTOOL

To verify the actual failover behavior, run these diagnostic commands while unplugging the primary NIC:

# Monitor bond events in real-time
tail -f /var/log/messages | grep bond

# Check active slave changes
watch -n 1 cat /proc/net/bonding/bond0 | grep "Active Slave"

# Verify ARP updates (run from another host)
arp -a | grep bond0-ip

From experience with Broadcom BCM5708 NICs on HP hardware, several factors could disrupt failover:

Network Manager Interference

Despite NM_CONTROLLED=yes in ifcfg files, NetworkManager may still interfere. Completely disable it:

service NetworkManager stop
chkconfig NetworkManager off

Switch Port Configuration

Some switches require special port settings for bonding. Verify these ProCurve 1800-8G settings:

interface 1-2
   spanning-tree disable
   no lacp
exit

Driver-Specific Issues

The bnx2 driver may need specific parameters. Create /etc/modprobe.d/bnx2.conf:

options bnx2 disable_msi=0 debug=0x1

Modify your bond0 configuration with these enhanced parameters:

# /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
IPADDR=192.168.11.222
NETMASK=255.255.255.0
GATEWAY=192.168.11.1
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS="mode=1 miimon=100 primary=eth0 fail_over_mac=1 updelay=2000 downdelay=2000 use_carrier=1"

Key parameters explained:

  • fail_over_mac=1: Ensure MAC address changes during failover
  • up/downdelay=2000: Give switches time to update MAC tables
  • use_carrier=1: Better link detection with Broadcom NICs

After implementing these changes, test failover with this procedure:

# Start continuous ping test
ping -I bond0 192.168.11.1

# In another terminal, monitor bond status
watch -n 0.5 'cat /proc/net/bonding/bond0 | grep -E "Active|MII"'

# Physically disconnect eth0 cable
# Should observe:
# 1. Brief ping interruption (1-2 packets)
# 2. Active slave changes to eth1 in watch output
# 3. Ping resumes automatically

For production environments, verify these additional components:

# Check kernel bonding support
grep BONDING /boot/config-$(uname -r)

# Verify module loading order
lsmod | grep -E 'bnx2|bonding'

# Ensure proper initramfs inclusion
dracut -f -v