When your AWS infrastructure suddenly starts throwing 503 "Back-end server is at capacity" errors without triggering any CloudWatch alarms, it's time for some serious debugging. The situation becomes particularly puzzling when:
- Direct Elastic IP access fails with "Connection reset by peer"
- Apache logs show nothing unusual
- Process counts appear normal (151 Apache processes in this case)
- Resource utilization looks healthy:
CPU: 7.45% avg (max 25.82%) Memory: 11.04% avg Disk: 62.18% avg usage on /
Through painful experience, we discovered that ELB health check settings can make or break your application's availability. Here's a proper health check configuration for Apache behind ELB:
# Example ELB Health Check Settings (AWS CLI)
aws elb configure-health-check \
--load-balancer-name my-load-balancer \
--health-check Target=HTTP:80/healthcheck.html,Interval=30,UnhealthyThreshold=2,HealthyThreshold=10,Timeout=5
The critical parameters are:
- Interval: 30 seconds (not too aggressive)
- UnhealthyThreshold: 2 consecutive failures before marking unhealthy
- HealthyThreshold: 10 consecutive successes before marking healthy
- Timeout: 5 seconds (matches typical Apache timeout)
These Apache settings helped stabilize our environment:
# In /etc/apache2/mods-available/mpm_prefork.conf
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxRequestWorkers 150
MaxConnectionsPerChild 10000
</IfModule>
# KeepAlive settings
KeepAlive On
KeepAliveTimeout 5
MaxKeepAliveRequests 100
Implement these CloudWatch alarms to catch issues early:
aws cloudwatch put-metric-alarm \
--alarm-name "High-5xx-Rate" \
--metric-name "HTTPCode_Backend_5XX" \
--namespace "AWS/ELB" \
--statistic "Sum" \
--period 60 \
--threshold 10 \
--comparison-operator "GreaterThanThreshold" \
--evaluation-periods 3 \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:my-sns-topic"
Sometimes the solution is surprisingly simple - creating a new AMI from your problematic instance can resolve phantom issues. The process:
- Create snapshot of the root volume
- Register new AMI from the snapshot
- Launch new instance from the fresh AMI
- Update your Auto Scaling Group or ELB configuration
This approach solved our issue when all other debugging failed, suggesting there might have been underlying filesystem or kernel-level problems in the original instance.
- Verify ELB health check endpoints respond quickly (under 200ms)
- Check Apache's MaxClients setting matches your instance size
- Monitor TCP connection states:
netstat -ant | awk '{print $6}' | sort | uniq -c
- Review kernel parameters for connection handling:
sysctl net.ipv4.tcp_tw_reuse sysctl net.core.somaxconn
For two years, our AWS infrastructure ran smoothly until suddenly encountering intermittent 503 "Service Unavailable: Back-end server is at capacity" errors. The strange part? CloudWatch showed no resource alarms triggering - CPU hovered around 7.45%, memory at 11.04%, and disk space at 62.18% utilization.
When bypassing the ELB via Elastic IP, we received connection resets:
HTTP request sent, awaiting response...
Read error (Connection reset by peer) in headers. Retrying.
Apache logs showed nothing unusual with 151 normal-looking httpd processes. A temporary fix was found in restarting Apache, but the root cause remained elusive.
During outages, local health checks succeeded:
curl http://localhost/server-status
# Returns 200 OK when ELB shows 503
This pointed to potential ELB health check misconfiguration. Our investigation revealed:
- Inconsistent health check settings across ELBs
- Overly aggressive thresholds (2 failures to mark unhealthy)
- Short timeout periods (5 seconds)
We standardized health checks with these AWS CLI commands:
aws elb configure-health-check \
--load-balancer-name my-load-balancer \
--health-check Target=HTTP:80/healthcheck,Interval=30,UnhealthyThreshold=5,HealthyThreshold=2,Timeout=10
Key parameters for traffic-spike resilience:
Interval: 30 seconds
UnhealthyThreshold: 5 consecutive failures
HealthyThreshold: 2 consecutive successes
Timeout: 10 seconds
RequestPath: /healthcheck (lightweight endpoint)
To complement ELB changes, we optimized Apache's mpm_prefork settings:
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxRequestWorkers 150
MaxConnectionsPerChild 10000
</IfModule>
Implemented proper monitoring with this CloudWatch alarm setup:
aws cloudwatch put-metric-alarm \
--alarm-name "High-5XX-Rate" \
--metric-name HTTPCode_Backend_5XX \
--namespace AWS/ELB \
--statistic Sum \
--period 60 \
--evaluation-periods 3 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:1234567890:alert-topic
The complete solution involved:
- Creating a fresh AMI from the problematic instance
- Standardizing relaxed health check parameters
- Implementing Apache connection recycling
- Adding CloudWatch alarms for 5XX errors
This combination resolved our intermittent 503 issues while making the infrastructure more resilient to traffic spikes.