How to Configure Amazon CloudWatch Alerts for EC2 Instance Downtime Monitoring


2 views

When running critical web services on AWS EC2, instance availability monitoring becomes crucial. Traditional metrics like CPU usage or network traffic don't directly indicate whether your server is actually down or just experiencing low traffic.

For comprehensive downtime detection, focus on these key metrics:

StatusCheckFailed
StatusCheckFailed_Instance
StatusCheckFailed_System

The first two metrics check instance-level health, while the third monitors underlying hardware issues.

Here's how to set up a basic alarm through AWS CLI:

aws cloudwatch put-metric-alarm \
--alarm-name "EC2-Instance-Down" \
--alarm-description "Alarm when instance fails status checks" \
--metric-name "StatusCheckFailed" \
--namespace "AWS/EC2" \
--statistic "Sum" \
--period 300 \
--threshold 1 \
--comparison-operator "GreaterThanOrEqualToThreshold" \
--dimensions "Name=InstanceId,Value=i-1234567890abcdef0" \
--evaluation-periods 1 \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:my-sns-topic"

For deeper monitoring, implement a custom HTTP health check:

#!/bin/bash
if curl -s --max-time 5 http://localhost/health-check | grep -q "OK"; then
    aws cloudwatch put-metric-data \
    --metric-name "WebserverHealth" \
    --namespace "Custom" \
    --value 1 \
    --dimensions "InstanceId=$INSTANCE_ID"
else
    aws cloudwatch put-metric-data \
    --metric-name "WebserverHealth" \
    --namespace "Custom" \
    --value 0 \
    --dimensions "InstanceId=$INSTANCE_ID"
fi

Set up multi-channel notifications through SNS:

aws sns create-topic --name "EC2-Downtime-Alerts"
aws sns subscribe \
--topic-arn "arn:aws:sns:us-east-1:123456789012:EC2-Downtime-Alerts" \
--protocol "email" \
--notification-endpoint "admin@example.com"
aws sns subscribe \
--topic-arn "arn:aws:sns:us-east-1:123456789012:EC2-Downtime-Alerts" \
--protocol "sms" \
--notification-endpoint "+15551234567"

For distributed systems, create external health checks:

const AWS = require('aws-sdk');
const https = require('https');

exports.handler = async (event) => {
    const options = {
        hostname: 'your-website.com',
        port: 443,
        path: '/health',
        method: 'GET',
        timeout: 5000
    };
    
    try {
        const response = await new Promise((resolve, reject) => {
            const req = https.request(options, (res) => {
                resolve(res.statusCode);
            });
            req.on('error', reject);
            req.end();
        });
        
        const cloudwatch = new AWS.CloudWatch();
        await cloudwatch.putMetricData({
            Namespace: 'ExternalHealthChecks',
            MetricData: [{
                MetricName: 'WebsiteAvailability',
                Dimensions: [{
                    Name: 'Domain',
                    Value: 'your-website.com'
                }],
                Value: response === 200 ? 1 : 0,
                Unit: 'None'
            }]
        }).promise();
    } catch (error) {
        console.error('Health check failed:', error);
    }
};

When running critical web servers on Amazon EC2, you need a reliable way to detect instance failures. While CloudWatch's default metrics (CPU utilization, network I/O) provide operational insights, they don't directly indicate server downtime. We need to implement a solution that specifically monitors instance availability.

The most effective approach combines CloudWatch with AWS services to create a comprehensive monitoring solution:

# Example AWS CLI command to create a custom metric alarm
aws cloudwatch put-metric-alarm \
--alarm-name "EC2-Instance-Down" \
--alarm-description "Alarm when instance status check fails" \
--metric-name "StatusCheckFailed" \
--namespace "AWS/EC2" \
--statistic "Maximum" \
--period 60 \
--threshold 1 \
--comparison-operator "GreaterThanOrEqualToThreshold" \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:my-sns-topic

For web servers, add HTTP health checks using CloudWatch Synthetics:

const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');

const basicCustomEntryPoint = async function () {
    let url = "https://your-website.com/health";

    let page = await synthetics.getPage();
    const response = await page.goto(url, {waitUntil: 'domcontentloaded', timeout: 30000});
    
    if (!response || response.status() !== 200) {
        throw "Failed to load page!";
    }
    
    await synthetics.takeScreenshot('loaded', 'loaded');
    let pageTitle = await page.title();
    log.info('Page title: ' + pageTitle);
};

exports.handler = async () => {
    return await basicCustomEntryPoint();
};

Set up an SNS topic with multiple notification endpoints (email, SMS, Lambda):

# CloudFormation snippet for notification setup
Resources:
  AlarmNotificationTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: EC2-Down-Notifications
      Subscription:
        - Protocol: email
          Endpoint: admin@yourdomain.com
        - Protocol: lambda
          Endpoint: !GetAtt NotificationLambda.Arn

For critical instances, configure automatic recovery:

aws ec2 create-launch-template \
--launch-template-name AutoRecoveryTemplate \
--launch-template-data '{
    "InstanceType": "t3.medium",
    "ImageId": "ami-0123456789abcdef0",
    "UserData": "IyEvYmluL2Jhc2gKc2VydmljZSBodHRwZCByZXN0YXJ0",
    "SecurityGroupIds": ["sg-0123456789abcdef0"]
}'
  • Set up multi-zone monitoring for critical applications
  • Implement escalating alert policies (e.g., first notification after 1 minute, second after 5 minutes)
  • Regularly test your monitoring configuration by intentionally failing instances
  • Consider using AWS Systems Manager for deeper instance health insights