When deploying containers in Docker Swarm mode, the HEALTHCHECK
directive serves as a critical liveness probe mechanism. Unlike standalone containers where health status primarily affects container lifecycle, Swarm mode introduces additional orchestration-layer behaviors.
From production observations and Docker source code analysis, here's what actually occurs when a Swarm task becomes unhealthy:
- Traffic Routing: Swarm's ingress network stops routing new requests to unhealthy tasks while maintaining existing connections
- Service Reconciliation: The orchestrator initiates a new task replacement while keeping the unhealthy one running during graceful period
- DNS Resolution: Swarm's internal DNS gradually removes unhealthy tasks from SRV records
Here's a production-grade HEALTHCHECK
configuration for a web service:
FROM nginx:alpine
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s \
--retries=3 CMD curl -f http://localhost/health || exit 1
COPY nginx.conf /etc/nginx/nginx.conf
While the Swarm API doesn't expose task health directly, you can monitor it through:
# Container-level health inspection
docker inspect --format '{{json .State.Health}}' container_id
# Service-level health summary
docker service ps --format "{{.Name}} {{.CurrentState}}" service_name
For stateful services, consider this circuit-breaker pattern:
HEALTHCHECK --interval=10s \
CMD bash -c "[[ -f /tmp/maintenance ]] && exit 0 || \
(curl -sf http://localhost/health || exit 1)"
- Health check failures won't appear in
docker service logs
- check container logs instead - Use
--health-cmd
parameter indocker service create
for runtime overrides - Set
--health-start-period
longer than your service's warmup time
For a database cluster with primary-replica setup:
# For primary node
HEALTHCHECK CMD pg_isready -U postgres -d primary_db
# For replica nodes
HEALTHCHECK CMD bash -c "pg_isready -U postgres -d replica_db && \
[[ $(psql -U postgres -tAc 'SELECT pg_is_in_recovery()') = 't' ]]"
When working with Docker Swarm, the HEALTHCHECK
directive serves as a crucial mechanism for container lifecycle management. Unlike standalone Docker containers, Swarm mode implements health checks with specific orchestration behaviors:
# Example HEALTHCHECK in Dockerfile
HEALTHCHECK --interval=30s --timeout=3s \
CMD curl -f http://localhost:8080/health || exit 1
Through extensive testing and Docker source code analysis, I've observed these concrete behaviors:
- Traffic Routing: Swarm's ingress network stops routing new requests to unhealthy tasks
- Service Reconciliation: The swarm manager initiates new task creation when health checks fail
- Grace Period: There's typically a 30-second delay before replacement begins
Here's how we implement robust health checks in production:
version: '3.8'
services:
webapp:
image: nginx:alpine
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 15s
timeout: 3s
retries: 3
start_period: 40s
deploy:
replicas: 3
update_config:
failure_action: rollback
restart_policy:
condition: any
For troubleshooting, these commands prove invaluable:
# Check container health status
docker inspect --format='{{json .State.Health}}' container_name
# View swarm service health states
docker service ps --format "table {{.Name}}\t{{.CurrentState}}\t{{.HealthStatus}}" service_name
For stateful services requiring special handling:
services:
database:
image: postgres:13
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
start_period: 2m
deploy:
placement:
constraints:
- node.role == manager
restart_policy:
condition: on-failure
delay: 20s