Troubleshooting Keepalived VRRP Script Failover Issues: Why Your Backup Server Won’t Take Over

When working with Keepalived's VRRP script monitoring, we often encounter situations where the backup server fails to take over despite detecting a primary service failure. The symptom you're seeing - where the script fails (returns exit code 1) but no failover occurs - typically indicates one of several configuration issues.

Let's examine the key components in your setup that might be causing this behavior:

vrrp_script chk_script {
    script "/usr/local/bin/failover.sh"
    interval 2
    weight 2
}

vrrp_instance HAInstance {
    track_script {
        chk_script weight 20
    }
}

The weight parameter appears twice with conflicting values (2 in the script definition and 20 in the track section). This inconsistency can cause unexpected behavior.

There are two distinct weight parameters in Keepalived:

Script-level weight (affects script's impact on the instance)
Track-script weight (additional modifier for the tracking relationship)

Here's the corrected configuration:

vrrp_script chk_script {
    script "/usr/local/bin/failover.sh"
    interval 2
    weight -20  # Negative value to force failover on script failure
}

vrrp_instance HAInstance {
    track_script {
        chk_script
    }
}

Your monitoring script could be enhanced for better reliability:

#!/bin/bash
SERVICE='process'
if pgrep -x "$SERVICE" >/dev/null; then
    exit 0
else
    logger -t keepalived "Service $SERVICE failed - triggering failover"
    exit 1
fi

Key improvements:

Uses pgrep which is more reliable than grepping ps output
Adds logging to help with troubleshooting
Simplified logic flow

To verify your setup is working properly, follow this test sequence:

On master: systemctl status keepalived (verify MASTER state)
On backup: systemctl status keepalived (verify BACKUP state)
Stop the monitored service on master
Check logs on both nodes: journalctl -u keepalived -f
Verify VIP migration with ip addr show eth0

Enable detailed logging in keepalived.conf:

global_defs {
    notification_email {
        admin@example.com
    }
    notification_email_from keepalived@example.com
    smtp_server 127.0.0.1
    smtp_connect_timeout 30
    enable_script_security
    script_user root
    router_id MY_KEEPALIVED
    vrrp_debug  # Enable verbose logging
}

This will provide detailed information about VRRP transitions and script execution in your system logs.

Mismatched virtual_router_id between nodes
Firewall blocking VRRP multicast (or unicast) packets
Incorrect interface specification
Missing nopreempt when desired
Script execution permissions issues

When working with Keepalived's VRRP scripts, a common frustration occurs when the script detects failure (exits with status 1) but the expected failover doesn't trigger. In your case, the chk_script properly detects the stopped process but the backup server doesn't assume the MASTER state.

Let's examine some critical configuration elements that might be causing this behavior:

vrrp_instance HAInstance {
    state BACKUP
    nopreempt
    track_script {
        chk_script weight 20
    }
}

The nopreempt parameter is particularly important here - it prevents a higher priority backup from taking over when the master fails.

Your current weight configuration has potential issues:

vrrp_script chk_script {
    weight 2
}

track_script {
    chk_script weight 20
}

The weight values need careful consideration. When the script fails (exit 1), the priority reduction should be significant enough to trigger failover.

Here's a modified configuration that should work more reliably:

vrrp_script chk_script {
    script "/usr/local/bin/failover.sh"
    interval 2
    weight -20    # Negative weight when script fails
    fall 2        # Required consecutive failures
    rise 2        # Required consecutive successes
}

vrrp_instance HAInstance {
    state BACKUP
    interface eth0
    virtual_router_id 8
    priority 100   # Base priority
    advert_int 1
    # Removed nopreempt for testing
    vrrp_unicast_bind 10.10.10.8
    vrrp_unicast_peer 10.10.10.9
    
    virtual_ipaddress {
        10.10.10.10/16 dev eth0
    }
    
    track_script {
        chk_script
    }
}

Your current script could be improved for better reliability:

#!/bin/bash
SERVICE='process'
if pgrep -x "$SERVICE" >/dev/null
then
    exit 0
else
    logger -t keepalived "Service $SERVICE failed"
    exit 1
fi

To properly debug this setup:

# On master server:
keepalived -D -l -S 0   # Run in foreground with debug logging

# On backup server:
tcpdump -i eth0 -n 'proto 112'   # Monitor VRRP advertisements

Check the priority values in the VRRP packets when your service fails - this will confirm if the weight reduction is working as expected.

Firewall rules blocking VRRP multicast (or unicast) traffic
Insufficient priority difference after weight reduction
Script execution permissions or path issues
Network interface stability problems

ServerDevWorker

Troubleshooting Keepalived VRRP Script Failover Issues: Why Your Backup Server Won’t Take Over

Related Articles