Keepalived VIP Failover Issue: Handling Systemd Network Reloads in Ubuntu 18.04

When using keepalived for VIP failover between Ubuntu 18.04 VMs, we encounter a critical edge case during systemd-networkd reloads. The virtual IP gets dropped during network service restarts, but keepalived fails to detect this condition because:

VRRP advertisements continue between nodes
Basic connectivity checks pass (ping works)
No native VIP presence verification exists

Initial attempts using ip addr grep checks create unstable states:

vrrp_script chk_proxyip {
    script "/sbin/ip addr |/bin/grep 1.2.3.4"
    interval 2
    fall 2
    rise 2
}

The fundamental issue is that once a node enters FAULT state, it requires manual intervention to recover. This creates an availability paradox where:

The backup node takes over correctly
The original master remains in FAULT indefinitely
Failback never occurs automatically

We need to implement a three-layer verification system:

vrrp_script chk_vip_health {
    script "/usr/local/bin/vip_healthcheck.sh"
    interval 3
    timeout 2
    rise 2
    fall 2
    weight -50
}

The healthcheck script should verify both local presence and remote accessibility:

#!/bin/bash
# /usr/local/bin/vip_healthcheck.sh

VIP="1.2.3.4"
LOCAL_CHECK=$(ip -o addr show | grep -w "$VIP")

# If VIP exists locally, verify it responds
if [ -n "$LOCAL_CHECK" ]; then
    if ! ping -c1 -w1 "$VIP" >/dev/null; then
        exit 1
    fi
    exit 0
else
    # If VIP doesn't exist locally, verify it exists somewhere
    if ping -c1 -w1 "$VIP" >/dev/null; then
        exit 0
    fi
    exit 1
fi

Modify your keepalived.conf with these critical parameters:

vrrp_instance VI_1 {
    state BACKUP
    interface ens160
    virtual_router_id 101
    priority 100  # Set identical priority on both nodes
    advert_int 1
    nopreempt     # Critical for stable operation
    authentication {
        auth_type PASS
        auth_pass secret
    }
    track_script {
        chk_vip_health
    }
    virtual_ipaddress {
        1.2.3.4
    }
    notify_master "/usr/local/bin/vip_takeover.sh"
    notify_backup "/usr/local/bin/vip_release.sh"
}

Create a systemd drop-in unit to properly sequence service restarts:

# /etc/systemd/system/keepalived.service.d/10-network-dependency.conf
[Unit]
After=network-online.target
Wants=network-online.target

# Prevent keepalived restart during network reloads
[Service]
RestartSec=10
ExecStartPre=/bin/sleep 5

The ideal behavior should be:

1. NodeA (MASTER) loses VIP during networkd reload
2. Healthcheck fails → NodeA enters FAULT state
3. NodeB detects missing VRRP ads → becomes MASTER
4. When NodeA recovers:
   - VIP healthcheck passes
   - NodeA becomes BACKUP
   - NodeB remains MASTER until failure
5. Controlled failback only during maintenance windows

This approach provides stable VIP availability while preventing the "ping-pong" effect between nodes. The key improvements are:

Dual verification (local presence + remote response)
nopreempt to prevent unnecessary failbacks
Proper systemd service ordering
State transition scripts for clean handoffs

During systemd-networkd reloads (common during updates on Ubuntu 18.04), the virtual IP gets dropped but Keepalived fails to trigger failover because:

VRRP advertisements continue between nodes
Basic connectivity checks pass (nodes can ping each other)
No native VIP presence monitoring exists

The attempted ping-based solution shows fundamental VRRP limitations:

vrrp_script chk_proxyip {
    script "/bin/ping -c 1 -w 1 1.2.3.4"
    interval 2
}

This creates a race condition where:

Master detects VIP loss and faults
Backup promotes and acquires VIP
Original master (higher priority) resumes advertisements
Endless failover loop occurs

This enhanced configuration solves the problem through state-aware checking:

vrrp_script chk_vip {
    script "/usr/local/bin/check_vip_state.sh"
    interval 2
    weight -50  # Penalize but don't force failover
    fall 2
    rise 1
}

vrrp_instance VI_1 {
    ...
    track_script {
        chk_vip
    }
    notify "/usr/local/bin/keepalived_state.sh"
}

check_vip_state.sh:

#!/bin/bash
# Check both local VIP presence AND remote accessibility
if ! ip addr show | grep -q "1.2.3.4"; then
    # Local check failed - VIP missing
    exit 1
fi

# Remote service check (adjust for your service)
if ! curl -m 1 http://1.2.3.4:80/health &>/dev/null; then
    exit 1
fi
exit 0

keepalived_state.sh:

#!/bin/bash
TYPE=$1
NAME=$2
STATE=$3

case $STATE in
    "FAULT")
        systemctl try-restart keepalived
        ;;
    "MASTER")
        /sbin/ip addr add 1.2.3.4/32 dev ens160
        ;;
    "BACKUP")
        /sbin/ip addr del 1.2.3.4/32 dev ens160
        ;;
esac

Weighted checks: Avoid immediate failover for transient issues
State notifications: Clean VIP management during transitions
Dual verification: Checks both VIP presence and service health
Self-healing: Automatic restart in fault state

For critical deployments, add these to your keepalived.conf:

global_defs {
    vrrp_strict  # Enforce protocol compliance
    vrrp_garp_master_refresh 60  # Refresh ARP regularly
    vrrp_garp_master_repeat 2
}

Monitor these system logs for troubleshooting:

journalctl -u keepalived -f
ip monitor address

ServerDevWorker

Keepalived VIP Failover Issue: Handling Systemd Network Reloads in Ubuntu 18.04

Related Articles