When using keepalived for VIP failover between Ubuntu 18.04 VMs, we encounter a critical edge case during systemd-networkd reloads. The virtual IP gets dropped during network service restarts, but keepalived fails to detect this condition because:
- VRRP advertisements continue between nodes
- Basic connectivity checks pass (ping works)
- No native VIP presence verification exists
Initial attempts using ip addr
grep checks create unstable states:
vrrp_script chk_proxyip {
script "/sbin/ip addr |/bin/grep 1.2.3.4"
interval 2
fall 2
rise 2
}
The fundamental issue is that once a node enters FAULT state, it requires manual intervention to recover. This creates an availability paradox where:
- The backup node takes over correctly
- The original master remains in FAULT indefinitely
- Failback never occurs automatically
We need to implement a three-layer verification system:
vrrp_script chk_vip_health {
script "/usr/local/bin/vip_healthcheck.sh"
interval 3
timeout 2
rise 2
fall 2
weight -50
}
The healthcheck script should verify both local presence and remote accessibility:
#!/bin/bash
# /usr/local/bin/vip_healthcheck.sh
VIP="1.2.3.4"
LOCAL_CHECK=$(ip -o addr show | grep -w "$VIP")
# If VIP exists locally, verify it responds
if [ -n "$LOCAL_CHECK" ]; then
if ! ping -c1 -w1 "$VIP" >/dev/null; then
exit 1
fi
exit 0
else
# If VIP doesn't exist locally, verify it exists somewhere
if ping -c1 -w1 "$VIP" >/dev/null; then
exit 0
fi
exit 1
fi
Modify your keepalived.conf with these critical parameters:
vrrp_instance VI_1 {
state BACKUP
interface ens160
virtual_router_id 101
priority 100 # Set identical priority on both nodes
advert_int 1
nopreempt # Critical for stable operation
authentication {
auth_type PASS
auth_pass secret
}
track_script {
chk_vip_health
}
virtual_ipaddress {
1.2.3.4
}
notify_master "/usr/local/bin/vip_takeover.sh"
notify_backup "/usr/local/bin/vip_release.sh"
}
Create a systemd drop-in unit to properly sequence service restarts:
# /etc/systemd/system/keepalived.service.d/10-network-dependency.conf
[Unit]
After=network-online.target
Wants=network-online.target
# Prevent keepalived restart during network reloads
[Service]
RestartSec=10
ExecStartPre=/bin/sleep 5
The ideal behavior should be:
1. NodeA (MASTER) loses VIP during networkd reload
2. Healthcheck fails → NodeA enters FAULT state
3. NodeB detects missing VRRP ads → becomes MASTER
4. When NodeA recovers:
- VIP healthcheck passes
- NodeA becomes BACKUP
- NodeB remains MASTER until failure
5. Controlled failback only during maintenance windows
This approach provides stable VIP availability while preventing the "ping-pong" effect between nodes. The key improvements are:
- Dual verification (local presence + remote response)
- nopreempt to prevent unnecessary failbacks
- Proper systemd service ordering
- State transition scripts for clean handoffs
During systemd-networkd reloads (common during updates on Ubuntu 18.04), the virtual IP gets dropped but Keepalived fails to trigger failover because:
- VRRP advertisements continue between nodes
- Basic connectivity checks pass (nodes can ping each other)
- No native VIP presence monitoring exists
The attempted ping-based solution shows fundamental VRRP limitations:
vrrp_script chk_proxyip {
script "/bin/ping -c 1 -w 1 1.2.3.4"
interval 2
}
This creates a race condition where:
- Master detects VIP loss and faults
- Backup promotes and acquires VIP
- Original master (higher priority) resumes advertisements
- Endless failover loop occurs
This enhanced configuration solves the problem through state-aware checking:
vrrp_script chk_vip {
script "/usr/local/bin/check_vip_state.sh"
interval 2
weight -50 # Penalize but don't force failover
fall 2
rise 1
}
vrrp_instance VI_1 {
...
track_script {
chk_vip
}
notify "/usr/local/bin/keepalived_state.sh"
}
check_vip_state.sh:
#!/bin/bash
# Check both local VIP presence AND remote accessibility
if ! ip addr show | grep -q "1.2.3.4"; then
# Local check failed - VIP missing
exit 1
fi
# Remote service check (adjust for your service)
if ! curl -m 1 http://1.2.3.4:80/health &>/dev/null; then
exit 1
fi
exit 0
keepalived_state.sh:
#!/bin/bash
TYPE=$1
NAME=$2
STATE=$3
case $STATE in
"FAULT")
systemctl try-restart keepalived
;;
"MASTER")
/sbin/ip addr add 1.2.3.4/32 dev ens160
;;
"BACKUP")
/sbin/ip addr del 1.2.3.4/32 dev ens160
;;
esac
- Weighted checks: Avoid immediate failover for transient issues
- State notifications: Clean VIP management during transitions
- Dual verification: Checks both VIP presence and service health
- Self-healing: Automatic restart in fault state
For critical deployments, add these to your keepalived.conf:
global_defs {
vrrp_strict # Enforce protocol compliance
vrrp_garp_master_refresh 60 # Refresh ARP regularly
vrrp_garp_master_repeat 2
}
Monitor these system logs for troubleshooting:
journalctl -u keepalived -f
ip monitor address