When building a high availability (HA) solution for Linux systems handling data collection or compute nodes, we need to consider several critical factors:
- Ability to detect both hard crashes and system hangs
- Minimal state transfer requirements (as storage isn't shared)
- Automatic failover with clean recovery when the primary comes back online
- Rsync-based synchronization between nodes
Here are the most robust options available for Linux, ordered by implementation complexity:
1. Keepalived + Custom Scripts (Lightweight)
Best for simple VIP failover scenarios:
# Sample keepalived.conf
vrrp_script chk_application {
script "/usr/local/bin/check_app.sh"
interval 2
weight 2
}
vrrp_instance VI_1 {
interface eth0
state MASTER
virtual_router_id 51
priority 101
virtual_ipaddress {
192.168.1.100/24
}
track_script {
chk_application
}
}
Pros: Extremely lightweight, easy to implement
Cons: Limited to IP failover, requires custom scripting for application checks
Effort: 2-3 days setup, minimal maintenance
2. Corosync + Pacemaker (Mid-range)
The de facto standard for Linux HA clustering:
# Sample corosync.conf
totem {
version: 2
cluster_name: mycluster
transport: udpu
interface {
ringnumber: 0
bindnetaddr: 192.168.1.0
mcastport: 5405
ttl: 1
}
}
Pros: Mature solution, handles both node and application failures
Cons: Steeper learning curve
Effort: 1 week setup, ongoing maintenance required
3. Docker Swarm/Kubernetes (Modern approach)
For containerized workloads:
# Sample docker-compose.yml with restart policies
version: '3'
services:
app:
image: myapp:latest
deploy:
replicas: 2
restart_policy:
condition: on-failure
Pros: Built-in health checks and failover
Cons: Requires containerization
Effort: Varies based on existing infrastructure
To handle cases where the server is pingable but unresponsive:
- Implement application-level health checks (not just ICMP)
- Use STONITH (Shoot The Other Node In The Head) when available
- Consider watchdog timers at both hardware and software levels
Here's how to integrate rsync with Pacemaker:
# Resource agent for rsync
primitive rsync_resource ocf:heartbeat:Filesystem \
params device="/mnt/rsync_target" directory="/mnt/rsync_target" fstype="none" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60" \
op monitor interval="20" timeout="40"
Combine this with a custom monitor script that verifies application health beyond just process existence.
Solution | Cost | Setup Time | Maintenance |
---|---|---|---|
Keepalived | Free | Low | Low |
Corosync/Pacemaker | Free | Medium | Medium |
Commercial (RH Cluster Suite) | $$ | Medium | Low-Medium |
Kubernetes | Free/$$ | High | High |
For your specific case of non-database workloads with rsync-based sync, I'd recommend starting with Pacemaker/Corosync as it provides the right balance of features without being overly complex.
When dealing with critical Linux servers, we often need automatic failover solutions that can:
- Detect server crashes or hangs (even if still pingable)
- Migrate applications to standby servers
- Prevent split-brain scenarios when the original server recovers
- Work without shared storage (using rsync for state synchronization)
1. Pacemaker + Corosync (Open Source)
The most robust open-source solution, comparable to Solaris VCS:
# Install on CentOS/RHEL:
sudo yum install pacemaker corosync pcs
sudo systemctl start pcsd
sudo systemctl enable pcsd
sudo pcs cluster auth node1 node2
sudo pcs cluster setup --name mycluster node1 node2
sudo pcs cluster start --all
Pros:
- Mature and feature-complete
- Supports complex failover scenarios
- Active/passive or active/active configurations
Cons:
- Steeper learning curve (2-3 days setup time)
- Requires careful configuration
2. Keepalived (Lightweight Option)
Excellent for simpler scenarios with VIP failover:
# Sample keepalived.conf
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass secret
}
virtual_ipaddress {
192.168.1.100/24
}
notify_master "/path/to/start_scripts.sh"
notify_backup "/path/to/stop_scripts.sh"
}
Pros:
- Simple to configure (1-2 hours setup)
- Minimal resource overhead
Cons:
- Limited to IP failover
- Requires custom scripts for application handling
Handling Application State
Since you're using rsync, consider this cron approach:
# Every 5 minutes sync application data
*/5 * * * * rsync -az --delete /app/data/ standby-server:/app/data/
Split-Brain Prevention
Essential fencing configuration for Pacemaker:
# Example STONITH configuration
sudo pcs stonith create myfence fence_ipmilan \
pcmk_host_list="node1 node2" \
ipaddr="10.0.0.1" \
login="admin" \
passwd="password" \
action="reboot"
For enterprise environments:
- Red Hat Cluster Suite (Pacemaker with commercial support)
- SUSE Linux Enterprise High Availability
- Veritas Cluster Server for Linux (familiar if coming from Solaris)
Commercial options typically cost $1,000-$5,000 per node annually but offer:
- Professional support
- GUI management tools
- Pre-built application agents
Critical post-setup steps:
# Test failover manually
sudo pcs node standby node1
# Verify resources moved to node2
sudo pcs status
# Bring node1 back online
sudo pcs node unstandby node1
Implement monitoring for:
- Cluster status (pacemaker/corosync)
- Resource availability
- Synchronization delays (rsync)