When working with Keepalived's VRRP script monitoring, we often encounter situations where the backup server fails to take over despite detecting a primary service failure. The symptom you're seeing - where the script fails (returns exit code 1) but no failover occurs - typically indicates one of several configuration issues.
Let's examine the key components in your setup that might be causing this behavior:
vrrp_script chk_script {
script "/usr/local/bin/failover.sh"
interval 2
weight 2
}
vrrp_instance HAInstance {
track_script {
chk_script weight 20
}
}
The weight
parameter appears twice with conflicting values (2 in the script definition and 20 in the track section). This inconsistency can cause unexpected behavior.
There are two distinct weight parameters in Keepalived:
- Script-level weight (affects script's impact on the instance)
- Track-script weight (additional modifier for the tracking relationship)
Here's the corrected configuration:
vrrp_script chk_script {
script "/usr/local/bin/failover.sh"
interval 2
weight -20 # Negative value to force failover on script failure
}
vrrp_instance HAInstance {
track_script {
chk_script
}
}
Your monitoring script could be enhanced for better reliability:
#!/bin/bash
SERVICE='process'
if pgrep -x "$SERVICE" >/dev/null; then
exit 0
else
logger -t keepalived "Service $SERVICE failed - triggering failover"
exit 1
fi
Key improvements:
- Uses
pgrep
which is more reliable than grepping ps output - Adds logging to help with troubleshooting
- Simplified logic flow
To verify your setup is working properly, follow this test sequence:
- On master:
systemctl status keepalived
(verify MASTER state) - On backup:
systemctl status keepalived
(verify BACKUP state) - Stop the monitored service on master
- Check logs on both nodes:
journalctl -u keepalived -f
- Verify VIP migration with
ip addr show eth0
Enable detailed logging in keepalived.conf:
global_defs {
notification_email {
admin@example.com
}
notification_email_from keepalived@example.com
smtp_server 127.0.0.1
smtp_connect_timeout 30
enable_script_security
script_user root
router_id MY_KEEPALIVED
vrrp_debug # Enable verbose logging
}
This will provide detailed information about VRRP transitions and script execution in your system logs.
- Mismatched
virtual_router_id
between nodes - Firewall blocking VRRP multicast (or unicast) packets
- Incorrect interface specification
- Missing
nopreempt
when desired - Script execution permissions issues
When working with Keepalived's VRRP scripts, a common frustration occurs when the script detects failure (exits with status 1) but the expected failover doesn't trigger. In your case, the chk_script
properly detects the stopped process but the backup server doesn't assume the MASTER state.
Let's examine some critical configuration elements that might be causing this behavior:
vrrp_instance HAInstance {
state BACKUP
nopreempt
track_script {
chk_script weight 20
}
}
The nopreempt
parameter is particularly important here - it prevents a higher priority backup from taking over when the master fails.
Your current weight configuration has potential issues:
vrrp_script chk_script {
weight 2
}
track_script {
chk_script weight 20
}
The weight values need careful consideration. When the script fails (exit 1), the priority reduction should be significant enough to trigger failover.
Here's a modified configuration that should work more reliably:
vrrp_script chk_script {
script "/usr/local/bin/failover.sh"
interval 2
weight -20 # Negative weight when script fails
fall 2 # Required consecutive failures
rise 2 # Required consecutive successes
}
vrrp_instance HAInstance {
state BACKUP
interface eth0
virtual_router_id 8
priority 100 # Base priority
advert_int 1
# Removed nopreempt for testing
vrrp_unicast_bind 10.10.10.8
vrrp_unicast_peer 10.10.10.9
virtual_ipaddress {
10.10.10.10/16 dev eth0
}
track_script {
chk_script
}
}
Your current script could be improved for better reliability:
#!/bin/bash
SERVICE='process'
if pgrep -x "$SERVICE" >/dev/null
then
exit 0
else
logger -t keepalived "Service $SERVICE failed"
exit 1
fi
To properly debug this setup:
# On master server:
keepalived -D -l -S 0 # Run in foreground with debug logging
# On backup server:
tcpdump -i eth0 -n 'proto 112' # Monitor VRRP advertisements
Check the priority values in the VRRP packets when your service fails - this will confirm if the weight reduction is working as expected.
- Firewall rules blocking VRRP multicast (or unicast) traffic
- Insufficient priority difference after weight reduction
- Script execution permissions or path issues
- Network interface stability problems