Linux High Availability Solutions: Failover Strategies for Non-Database Workloads

When building a high availability (HA) solution for Linux systems handling data collection or compute nodes, we need to consider several critical factors:

Ability to detect both hard crashes and system hangs
Minimal state transfer requirements (as storage isn't shared)
Automatic failover with clean recovery when the primary comes back online
Rsync-based synchronization between nodes

Here are the most robust options available for Linux, ordered by implementation complexity:

1. Keepalived + Custom Scripts (Lightweight)

Best for simple VIP failover scenarios:


# Sample keepalived.conf
vrrp_script chk_application {
    script "/usr/local/bin/check_app.sh"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    interface eth0
    state MASTER
    virtual_router_id 51
    priority 101
    virtual_ipaddress {
        192.168.1.100/24
    }
    track_script {
        chk_application
    }
}

Pros: Extremely lightweight, easy to implement
Cons: Limited to IP failover, requires custom scripting for application checks
Effort: 2-3 days setup, minimal maintenance

2. Corosync + Pacemaker (Mid-range)

The de facto standard for Linux HA clustering:


# Sample corosync.conf
totem {
    version: 2
    cluster_name: mycluster
    transport: udpu
    interface {
        ringnumber: 0
        bindnetaddr: 192.168.1.0
        mcastport: 5405
        ttl: 1
    }
}

Pros: Mature solution, handles both node and application failures
Cons: Steeper learning curve
Effort: 1 week setup, ongoing maintenance required

3. Docker Swarm/Kubernetes (Modern approach)

For containerized workloads:


# Sample docker-compose.yml with restart policies
version: '3'
services:
  app:
    image: myapp:latest
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure

Pros: Built-in health checks and failover
Cons: Requires containerization
Effort: Varies based on existing infrastructure

To handle cases where the server is pingable but unresponsive:

Implement application-level health checks (not just ICMP)
Use STONITH (Shoot The Other Node In The Head) when available
Consider watchdog timers at both hardware and software levels

Here's how to integrate rsync with Pacemaker:


# Resource agent for rsync
primitive rsync_resource ocf:heartbeat:Filesystem \
    params device="/mnt/rsync_target" directory="/mnt/rsync_target" fstype="none" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    op monitor interval="20" timeout="40"

Combine this with a custom monitor script that verifies application health beyond just process existence.

Solution	Cost	Setup Time	Maintenance
Keepalived	Free	Low	Low
Corosync/Pacemaker	Free	Medium	Medium
Commercial (RH Cluster Suite)	$$	Medium	Low-Medium
Kubernetes	Free/$$	High	High

For your specific case of non-database workloads with rsync-based sync, I'd recommend starting with Pacemaker/Corosync as it provides the right balance of features without being overly complex.

When dealing with critical Linux servers, we often need automatic failover solutions that can:

Detect server crashes or hangs (even if still pingable)
Migrate applications to standby servers
Prevent split-brain scenarios when the original server recovers
Work without shared storage (using rsync for state synchronization)

1. Pacemaker + Corosync (Open Source)

The most robust open-source solution, comparable to Solaris VCS:


# Install on CentOS/RHEL:
sudo yum install pacemaker corosync pcs
sudo systemctl start pcsd
sudo systemctl enable pcsd
sudo pcs cluster auth node1 node2
sudo pcs cluster setup --name mycluster node1 node2
sudo pcs cluster start --all

Pros:

Mature and feature-complete
Supports complex failover scenarios
Active/passive or active/active configurations

Cons:

Steeper learning curve (2-3 days setup time)
Requires careful configuration

2. Keepalived (Lightweight Option)

Excellent for simpler scenarios with VIP failover:


# Sample keepalived.conf
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass secret
    }
    virtual_ipaddress {
        192.168.1.100/24
    }
    notify_master "/path/to/start_scripts.sh"
    notify_backup "/path/to/stop_scripts.sh"
}

Pros:

Simple to configure (1-2 hours setup)
Minimal resource overhead

Cons:

Limited to IP failover
Requires custom scripts for application handling

Handling Application State

Since you're using rsync, consider this cron approach:


# Every 5 minutes sync application data
*/5 * * * * rsync -az --delete /app/data/ standby-server:/app/data/

Split-Brain Prevention

Essential fencing configuration for Pacemaker:


# Example STONITH configuration
sudo pcs stonith create myfence fence_ipmilan \
    pcmk_host_list="node1 node2" \
    ipaddr="10.0.0.1" \
    login="admin" \
    passwd="password" \
    action="reboot"

For enterprise environments:

Red Hat Cluster Suite (Pacemaker with commercial support)
SUSE Linux Enterprise High Availability
Veritas Cluster Server for Linux (familiar if coming from Solaris)

Commercial options typically cost $1,000-$5,000 per node annually but offer:

Professional support
GUI management tools
Pre-built application agents

Critical post-setup steps:


# Test failover manually
sudo pcs node standby node1
# Verify resources moved to node2
sudo pcs status
# Bring node1 back online
sudo pcs node unstandby node1

Implement monitoring for:

Cluster status (pacemaker/corosync)
Resource availability
Synchronization delays (rsync)

ServerDevWorker