Unicast High Availability Alternatives: Modern Solutions Beyond Heartbeat/Pacemaker for EC2 Clusters


2 views

When building fault-tolerant systems on AWS EC2, traditional HA solutions like CoroSync face limitations due to their multicast dependencies. EC2's VPC architecture only supports unicast communication, forcing us to explore alternative clustering technologies.

HashiCorp's Consul provides a robust alternative with native unicast support. Here's a basic configuration for HAProxy failover:


services {
  name = "haproxy"
  port = 80
  check {
    args = ["curl", "-f", "http://localhost:80/health"]
    interval = "10s"
    timeout = "1s"
  }
}

Key advantages include:

  • Gossip-based membership protocol (Serf) works in unicast environments
  • Built-in DNS interface for service discovery
  • Automatic health checking and failover

For simpler VIP failover scenarios, Keepalived offers a unicast-compatible solution. Example configuration for EC2:


vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    unicast_peer {
        10.0.1.5
        10.0.1.6
    }
    virtual_ipaddress {
        10.0.1.100/24
    }
}

Note: Requires careful tuning of timers for cloud environments.

For AWS-specific deployments, consider these integrated options:

  • ALB/ELB: For stateless services, use target group health checks
  • Route 53: DNS-based failover with health checking
  • EKS/ECS: Container orchestration with built-in service recovery

When evaluating alternatives, test these critical factors:

Solution Quorum Support Automation Friendly State Handling
Consul Yes Excellent Limited
Keepalived No Moderate Basic
Route 53 N/A Excellent None

For Solr specifically, consider combining Zookeeper with custom health checks rather than traditional HA clustering.


When implementing high availability solutions on AWS EC2, we immediately hit the multicast limitation. Traditional solutions like CoroSync become non-starters since they rely on multicast communication. While Heartbeat+Pacemaker works, it presents several operational challenges in dynamic cloud environments:

# Typical Heartbeat config limitation (two-node only)
autojoin none
ucast eth0 10.0.1.2 # Hardcoded peer IP

After extensive testing across multiple AWS regions, these alternatives proved most effective for unicast environments:

1. Keepalived with VRRP

While traditionally used for L4 failover, Keepalived can manage service-level HA when combined with custom scripts:

vrrp_script chk_haproxy {
    script "killall -0 haproxy"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    interface eth0
    state MASTER
    virtual_router_id 51
    priority 101
    virtual_ipaddress {
        10.0.1.100/24 dev eth0
    }
    track_script {
        chk_haproxy
    }
}

2. Consul with Health Checks

HashiCorp's Consul provides service discovery with native health checking capabilities. This example shows Solr failover configuration:

service {
  name = "solr"
  port = 8983
  check = {
    id = "solr-http"
    name = "Solr HTTP Check"
    http = "http://localhost:8983/solr/admin/ping"
    interval = "10s"
    timeout = "1s"
  }
}

3. Patroni for PostgreSQL-like Services

While designed for PostgreSQL, Patroni's architecture pattern works well for other services:

scope: solr_cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.1:8008

etcd:
  hosts: 10.0.1.1:2379,10.0.1.2:2379,10.0.1.3:2379

bootstrap:
  dcs:
    ttl: 30
    retry_timeout: 10

The true test of any HA solution in cloud environments is its ability to be fully automated. Terraform combined with cloud-init provides the most robust approach:

resource "aws_instance" "ha_node" {
  count = 3
  user_data = templatefile("${path.module}/cloud-init.yaml", {
    node_index = count.index
    cluster_nodes = aws_instance.ha_node.*.private_ip
  })
}

# cloud-init.yaml
packages:
  - consul
  - haproxy

write_files:
- path: /etc/consul.d/solr.json
  content: |
    ${service_definition}

Each solution presents different characteristics:

Solution Service Types Node Limit Learning Curve
Keepalived L4/L7 255 Low
Consul Any Unlimited Medium
Patroni Stateful Unlimited High

The choice ultimately depends on your specific service requirements and operational constraints. For most AWS deployments, Consul-based solutions provide the best balance of flexibility and reliability.