When building resilient systems on AWS EC2 without relying on managed services like Autoscale, many teams end up developing custom automation scripts. The typical workflow involves:
# Python/Boto example for failover monitoring
import boto.ec2
from socket import socket, AF_INET, SOCK_DGRAM
def monitor_instances():
sock = socket(AF_INET, SOCK_DGRAM)
sock.bind(('0.0.0.0', 9999))
while True:
data, addr = sock.recvfrom(1024)
instance_id = data.decode('utf-8')
if not check_instance_health(instance_id):
handle_failover(instance_id)
def handle_failover(instance_id):
conn = boto.ec2.connect_to_region('us-west-2')
instance = conn.get_all_instances([instance_id])[0].instances[0]
# Create snapshot
for vol in instance.block_device_mapping.values():
snap = conn.create_snapshot(vol.volume_id)
# Launch replacement
new_instance = conn.run_instances(
instance.image_id,
instance_type=instance.instance_type,
security_groups=instance.groups
)
# Clean up old resources
conn.terminate_instances([instance_id])
Solutions like Pacemaker and Heartbeat were designed for physical servers with static IPs and persistent storage. EC2's ephemeral nature creates several challenges:
- IP address changes during failover
- Storage detachment/reattachment delays
- Metadata service dependencies
- Region/availability zone constraints
While comprehensive solutions remain scarce, these projects show promise for EC2 automation:
# Example using Netflix's Edda for tracking instance state
const aws = require('aws-sdk');
const eddaClient = require('edda-client');
async function getFailedInstances() {
const records = await eddaClient.getChangeRecords();
return records.filter(r =>
r.state === 'terminated' &&
r.reason.code === 'Server.InternalError'
);
}
A robust EC2 failover system should handle:
# Bash snippet for pre-failover checks
#!/bin/bash
check_network_connectivity() {
ping -c 3 169.254.169.254 || return 1
curl -s http://169.254.169.254/latest/meta-data/instance-id || return 1
}
check_storage_health() {
lsblk | grep -v NAME | awk '{print $6}' | grep -v 0 && return 1
return 0
}
While avoiding full Autoscale, you can leverage specific AWS features:
- CloudWatch Events for state changes
- SNS notifications for alerting
- Lambda functions for lightweight automation
- Systems Manager for run commands
When building resilient systems on AWS EC2 without managed services, engineers face a fundamental challenge: implementing robust failover mechanisms. The EC2 API provides the building blocks, but assembling them into production-grade automation requires careful design.
Commercial solutions like AWS Auto Scaling or RightScale often impose constraints on:
- Supported software versions
- Configuration flexibility
- Customization capabilities
Traditional HA tools (Pacemaker, Heartbeat) struggle with EC2's ephemeral nature and lack native cloud integration.
Here's a proven architecture using Python's Boto library:
import boto.ec2
from time import sleep
class EC2FailoverManager:
def __init__(self, region, access_key, secret_key):
self.conn = boto.ec2.connect_to_region(
region,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key
)
self.failed_instances = set()
def monitor_instances(self, instance_ids):
while True:
current_status = {
i.id: i.state for i in
self.conn.get_all_instance_status(instance_ids=instance_ids)
}
for instance_id, state in current_status.items():
if state == 'terminated' and instance_id not in self.failed_instances:
self.handle_failure(instance_id)
sleep(60)
def handle_failure(self, instance_id):
print(f"Instance {instance_id} failed. Initiating recovery...")
# 1. Locate original instance's volumes
volumes = self.conn.get_all_volumes(
filters={'attachment.instance-id': instance_id}
)
# 2. Create snapshots
snapshots = [
self.conn.create_snapshot(v.id)
for v in volumes
]
# 3. Wait for snapshot completion
for snap in snapshots:
while snap.status != 'completed':
sleep(5)
snap.update()
# 4. Launch replacement
original = self.conn.get_only_instances([instance_id])[0]
new_instance = self.conn.run_instances(
image_id=original.image_id,
instance_type=original.instance_type,
security_groups=original.groups,
subnet_id=original.subnet_id,
# Additional launch parameters...
)
self.failed_instances.add(instance_id)
return new_instance
Keep-Alive Monitoring: Implement UDP heartbeat checks between instances and your monitoring system. This catches network partitions that EC2 status checks might miss.
Data Consistency: For stateful services, consider:
- Using EBS multi-attach for certain workloads
- Implementing distributed consensus protocols (Raft/Paxos)
- Scheduled snapshots with application-consistent quiescing
While custom scripts offer maximum flexibility, these projects provide useful components:
- Netflix's Edda: Configuration tracking for audit trails
- Spotify's Luigi: Pipeline management for recovery workflows
- Zookeeper/etcd: Distributed coordination for cluster awareness
From production deployments:
- Tag all resources with recovery groups
- Implement exponential backoff for retries
- Maintain warm standby pools in different AZs
- Test failure modes during off-peak hours