Solving AWS ELB 503 Errors: Apache2 Backend Server Capacity Issues and Health Check Optimization


2 views

When your AWS infrastructure suddenly starts throwing 503 "Back-end server is at capacity" errors without triggering any CloudWatch alarms, it's time for some serious debugging. The situation becomes particularly puzzling when:

  • Direct Elastic IP access fails with "Connection reset by peer"
  • Apache logs show nothing unusual
  • Process counts appear normal (151 Apache processes in this case)
  • Resource utilization looks healthy:
    CPU: 7.45% avg (max 25.82%)
    Memory: 11.04% avg
    Disk: 62.18% avg usage on /
    

Through painful experience, we discovered that ELB health check settings can make or break your application's availability. Here's a proper health check configuration for Apache behind ELB:

# Example ELB Health Check Settings (AWS CLI)
aws elb configure-health-check \
    --load-balancer-name my-load-balancer \
    --health-check Target=HTTP:80/healthcheck.html,Interval=30,UnhealthyThreshold=2,HealthyThreshold=10,Timeout=5

The critical parameters are:

  • Interval: 30 seconds (not too aggressive)
  • UnhealthyThreshold: 2 consecutive failures before marking unhealthy
  • HealthyThreshold: 10 consecutive successes before marking healthy
  • Timeout: 5 seconds (matches typical Apache timeout)

These Apache settings helped stabilize our environment:

# In /etc/apache2/mods-available/mpm_prefork.conf
<IfModule mpm_prefork_module>
    StartServers            5
    MinSpareServers         5
    MaxSpareServers        10
    MaxRequestWorkers     150
    MaxConnectionsPerChild 10000
</IfModule>

# KeepAlive settings
KeepAlive On
KeepAliveTimeout 5
MaxKeepAliveRequests 100

Implement these CloudWatch alarms to catch issues early:

aws cloudwatch put-metric-alarm \
    --alarm-name "High-5xx-Rate" \
    --metric-name "HTTPCode_Backend_5XX" \
    --namespace "AWS/ELB" \
    --statistic "Sum" \
    --period 60 \
    --threshold 10 \
    --comparison-operator "GreaterThanThreshold" \
    --evaluation-periods 3 \
    --alarm-actions "arn:aws:sns:us-east-1:123456789012:my-sns-topic"

Sometimes the solution is surprisingly simple - creating a new AMI from your problematic instance can resolve phantom issues. The process:

  1. Create snapshot of the root volume
  2. Register new AMI from the snapshot
  3. Launch new instance from the fresh AMI
  4. Update your Auto Scaling Group or ELB configuration

This approach solved our issue when all other debugging failed, suggesting there might have been underlying filesystem or kernel-level problems in the original instance.

  • Verify ELB health check endpoints respond quickly (under 200ms)
  • Check Apache's MaxClients setting matches your instance size
  • Monitor TCP connection states: netstat -ant | awk '{print $6}' | sort | uniq -c
  • Review kernel parameters for connection handling:
    sysctl net.ipv4.tcp_tw_reuse
    sysctl net.core.somaxconn
    

For two years, our AWS infrastructure ran smoothly until suddenly encountering intermittent 503 "Service Unavailable: Back-end server is at capacity" errors. The strange part? CloudWatch showed no resource alarms triggering - CPU hovered around 7.45%, memory at 11.04%, and disk space at 62.18% utilization.

When bypassing the ELB via Elastic IP, we received connection resets:

HTTP request sent, awaiting response... 
Read error (Connection reset by peer) in headers. Retrying.

Apache logs showed nothing unusual with 151 normal-looking httpd processes. A temporary fix was found in restarting Apache, but the root cause remained elusive.

During outages, local health checks succeeded:

curl http://localhost/server-status
# Returns 200 OK when ELB shows 503

This pointed to potential ELB health check misconfiguration. Our investigation revealed:

  • Inconsistent health check settings across ELBs
  • Overly aggressive thresholds (2 failures to mark unhealthy)
  • Short timeout periods (5 seconds)

We standardized health checks with these AWS CLI commands:

aws elb configure-health-check \
    --load-balancer-name my-load-balancer \
    --health-check Target=HTTP:80/healthcheck,Interval=30,UnhealthyThreshold=5,HealthyThreshold=2,Timeout=10

Key parameters for traffic-spike resilience:

Interval: 30 seconds  
UnhealthyThreshold: 5 consecutive failures
HealthyThreshold: 2 consecutive successes  
Timeout: 10 seconds
RequestPath: /healthcheck (lightweight endpoint)

To complement ELB changes, we optimized Apache's mpm_prefork settings:

<IfModule mpm_prefork_module>
    StartServers            5
    MinSpareServers         5  
    MaxSpareServers        10
    MaxRequestWorkers     150
    MaxConnectionsPerChild 10000
</IfModule>

Implemented proper monitoring with this CloudWatch alarm setup:

aws cloudwatch put-metric-alarm \
    --alarm-name "High-5XX-Rate" \
    --metric-name HTTPCode_Backend_5XX \
    --namespace AWS/ELB \
    --statistic Sum \
    --period 60 \
    --evaluation-periods 3 \
    --threshold 10 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:us-east-1:1234567890:alert-topic

The complete solution involved:

  1. Creating a fresh AMI from the problematic instance
  2. Standardizing relaxed health check parameters
  3. Implementing Apache connection recycling
  4. Adding CloudWatch alarms for 5XX errors

This combination resolved our intermittent 503 issues while making the infrastructure more resilient to traffic spikes.