Linux High Availability Solutions: Failover Strategies for Non-Database Workloads


2 views

When building a high availability (HA) solution for Linux systems handling data collection or compute nodes, we need to consider several critical factors:

  • Ability to detect both hard crashes and system hangs
  • Minimal state transfer requirements (as storage isn't shared)
  • Automatic failover with clean recovery when the primary comes back online
  • Rsync-based synchronization between nodes

Here are the most robust options available for Linux, ordered by implementation complexity:

1. Keepalived + Custom Scripts (Lightweight)

Best for simple VIP failover scenarios:


# Sample keepalived.conf
vrrp_script chk_application {
    script "/usr/local/bin/check_app.sh"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    interface eth0
    state MASTER
    virtual_router_id 51
    priority 101
    virtual_ipaddress {
        192.168.1.100/24
    }
    track_script {
        chk_application
    }
}

Pros: Extremely lightweight, easy to implement
Cons: Limited to IP failover, requires custom scripting for application checks
Effort: 2-3 days setup, minimal maintenance

2. Corosync + Pacemaker (Mid-range)

The de facto standard for Linux HA clustering:


# Sample corosync.conf
totem {
    version: 2
    cluster_name: mycluster
    transport: udpu
    interface {
        ringnumber: 0
        bindnetaddr: 192.168.1.0
        mcastport: 5405
        ttl: 1
    }
}

Pros: Mature solution, handles both node and application failures
Cons: Steeper learning curve
Effort: 1 week setup, ongoing maintenance required

3. Docker Swarm/Kubernetes (Modern approach)

For containerized workloads:


# Sample docker-compose.yml with restart policies
version: '3'
services:
  app:
    image: myapp:latest
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure

Pros: Built-in health checks and failover
Cons: Requires containerization
Effort: Varies based on existing infrastructure

To handle cases where the server is pingable but unresponsive:

  • Implement application-level health checks (not just ICMP)
  • Use STONITH (Shoot The Other Node In The Head) when available
  • Consider watchdog timers at both hardware and software levels

Here's how to integrate rsync with Pacemaker:


# Resource agent for rsync
primitive rsync_resource ocf:heartbeat:Filesystem \
    params device="/mnt/rsync_target" directory="/mnt/rsync_target" fstype="none" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    op monitor interval="20" timeout="40"

Combine this with a custom monitor script that verifies application health beyond just process existence.

Solution Cost Setup Time Maintenance
Keepalived Free Low Low
Corosync/Pacemaker Free Medium Medium
Commercial (RH Cluster Suite) $$ Medium Low-Medium
Kubernetes Free/$$ High High

For your specific case of non-database workloads with rsync-based sync, I'd recommend starting with Pacemaker/Corosync as it provides the right balance of features without being overly complex.


When dealing with critical Linux servers, we often need automatic failover solutions that can:

  • Detect server crashes or hangs (even if still pingable)
  • Migrate applications to standby servers
  • Prevent split-brain scenarios when the original server recovers
  • Work without shared storage (using rsync for state synchronization)

1. Pacemaker + Corosync (Open Source)

The most robust open-source solution, comparable to Solaris VCS:


# Install on CentOS/RHEL:
sudo yum install pacemaker corosync pcs
sudo systemctl start pcsd
sudo systemctl enable pcsd
sudo pcs cluster auth node1 node2
sudo pcs cluster setup --name mycluster node1 node2
sudo pcs cluster start --all

Pros:

  • Mature and feature-complete
  • Supports complex failover scenarios
  • Active/passive or active/active configurations

Cons:

  • Steeper learning curve (2-3 days setup time)
  • Requires careful configuration

2. Keepalived (Lightweight Option)

Excellent for simpler scenarios with VIP failover:


# Sample keepalived.conf
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass secret
    }
    virtual_ipaddress {
        192.168.1.100/24
    }
    notify_master "/path/to/start_scripts.sh"
    notify_backup "/path/to/stop_scripts.sh"
}

Pros:

  • Simple to configure (1-2 hours setup)
  • Minimal resource overhead

Cons:

  • Limited to IP failover
  • Requires custom scripts for application handling

Handling Application State

Since you're using rsync, consider this cron approach:


# Every 5 minutes sync application data
*/5 * * * * rsync -az --delete /app/data/ standby-server:/app/data/

Split-Brain Prevention

Essential fencing configuration for Pacemaker:


# Example STONITH configuration
sudo pcs stonith create myfence fence_ipmilan \
    pcmk_host_list="node1 node2" \
    ipaddr="10.0.0.1" \
    login="admin" \
    passwd="password" \
    action="reboot"

For enterprise environments:

  • Red Hat Cluster Suite (Pacemaker with commercial support)
  • SUSE Linux Enterprise High Availability
  • Veritas Cluster Server for Linux (familiar if coming from Solaris)

Commercial options typically cost $1,000-$5,000 per node annually but offer:

  • Professional support
  • GUI management tools
  • Pre-built application agents

Critical post-setup steps:


# Test failover manually
sudo pcs node standby node1
# Verify resources moved to node2
sudo pcs status
# Bring node1 back online
sudo pcs node unstandby node1

Implement monitoring for:

  • Cluster status (pacemaker/corosync)
  • Resource availability
  • Synchronization delays (rsync)