Keepalived VIP Failover Issue: Handling Systemd Network Reloads in Ubuntu 18.04


6 views

When using keepalived for VIP failover between Ubuntu 18.04 VMs, we encounter a critical edge case during systemd-networkd reloads. The virtual IP gets dropped during network service restarts, but keepalived fails to detect this condition because:

  • VRRP advertisements continue between nodes
  • Basic connectivity checks pass (ping works)
  • No native VIP presence verification exists

Initial attempts using ip addr grep checks create unstable states:

vrrp_script chk_proxyip {
    script "/sbin/ip addr |/bin/grep 1.2.3.4"
    interval 2
    fall 2
    rise 2
}

The fundamental issue is that once a node enters FAULT state, it requires manual intervention to recover. This creates an availability paradox where:

  1. The backup node takes over correctly
  2. The original master remains in FAULT indefinitely
  3. Failback never occurs automatically

We need to implement a three-layer verification system:

vrrp_script chk_vip_health {
    script "/usr/local/bin/vip_healthcheck.sh"
    interval 3
    timeout 2
    rise 2
    fall 2
    weight -50
}

The healthcheck script should verify both local presence and remote accessibility:

#!/bin/bash
# /usr/local/bin/vip_healthcheck.sh

VIP="1.2.3.4"
LOCAL_CHECK=$(ip -o addr show | grep -w "$VIP")

# If VIP exists locally, verify it responds
if [ -n "$LOCAL_CHECK" ]; then
    if ! ping -c1 -w1 "$VIP" >/dev/null; then
        exit 1
    fi
    exit 0
else
    # If VIP doesn't exist locally, verify it exists somewhere
    if ping -c1 -w1 "$VIP" >/dev/null; then
        exit 0
    fi
    exit 1
fi

Modify your keepalived.conf with these critical parameters:

vrrp_instance VI_1 {
    state BACKUP
    interface ens160
    virtual_router_id 101
    priority 100  # Set identical priority on both nodes
    advert_int 1
    nopreempt     # Critical for stable operation
    authentication {
        auth_type PASS
        auth_pass secret
    }
    track_script {
        chk_vip_health
    }
    virtual_ipaddress {
        1.2.3.4
    }
    notify_master "/usr/local/bin/vip_takeover.sh"
    notify_backup "/usr/local/bin/vip_release.sh"
}

Create a systemd drop-in unit to properly sequence service restarts:

# /etc/systemd/system/keepalived.service.d/10-network-dependency.conf
[Unit]
After=network-online.target
Wants=network-online.target

# Prevent keepalived restart during network reloads
[Service]
RestartSec=10
ExecStartPre=/bin/sleep 5

The ideal behavior should be:

1. NodeA (MASTER) loses VIP during networkd reload
2. Healthcheck fails → NodeA enters FAULT state
3. NodeB detects missing VRRP ads → becomes MASTER
4. When NodeA recovers:
   - VIP healthcheck passes
   - NodeA becomes BACKUP
   - NodeB remains MASTER until failure
5. Controlled failback only during maintenance windows

This approach provides stable VIP availability while preventing the "ping-pong" effect between nodes. The key improvements are:

  • Dual verification (local presence + remote response)
  • nopreempt to prevent unnecessary failbacks
  • Proper systemd service ordering
  • State transition scripts for clean handoffs

During systemd-networkd reloads (common during updates on Ubuntu 18.04), the virtual IP gets dropped but Keepalived fails to trigger failover because:

  • VRRP advertisements continue between nodes
  • Basic connectivity checks pass (nodes can ping each other)
  • No native VIP presence monitoring exists

The attempted ping-based solution shows fundamental VRRP limitations:

vrrp_script chk_proxyip {
    script "/bin/ping -c 1 -w 1 1.2.3.4"
    interval 2
}

This creates a race condition where:

  1. Master detects VIP loss and faults
  2. Backup promotes and acquires VIP
  3. Original master (higher priority) resumes advertisements
  4. Endless failover loop occurs

This enhanced configuration solves the problem through state-aware checking:

vrrp_script chk_vip {
    script "/usr/local/bin/check_vip_state.sh"
    interval 2
    weight -50  # Penalize but don't force failover
    fall 2
    rise 1
}

vrrp_instance VI_1 {
    ...
    track_script {
        chk_vip
    }
    notify "/usr/local/bin/keepalived_state.sh"
}

check_vip_state.sh:

#!/bin/bash
# Check both local VIP presence AND remote accessibility
if ! ip addr show | grep -q "1.2.3.4"; then
    # Local check failed - VIP missing
    exit 1
fi

# Remote service check (adjust for your service)
if ! curl -m 1 http://1.2.3.4:80/health &>/dev/null; then
    exit 1
fi
exit 0

keepalived_state.sh:

#!/bin/bash
TYPE=$1
NAME=$2
STATE=$3

case $STATE in
    "FAULT")
        systemctl try-restart keepalived
        ;;
    "MASTER")
        /sbin/ip addr add 1.2.3.4/32 dev ens160
        ;;
    "BACKUP")
        /sbin/ip addr del 1.2.3.4/32 dev ens160
        ;;
esac
  • Weighted checks: Avoid immediate failover for transient issues
  • State notifications: Clean VIP management during transitions
  • Dual verification: Checks both VIP presence and service health
  • Self-healing: Automatic restart in fault state

For critical deployments, add these to your keepalived.conf:

global_defs {
    vrrp_strict  # Enforce protocol compliance
    vrrp_garp_master_refresh 60  # Refresh ARP regularly
    vrrp_garp_master_repeat 2
}

Monitor these system logs for troubleshooting:

journalctl -u keepalived -f
ip monitor address