Docker Swarm HEALTHCHECK: Behavior Analysis and Implementation in Production


5 views

When deploying containers in Docker Swarm mode, the HEALTHCHECK directive serves as a critical liveness probe mechanism. Unlike standalone containers where health status primarily affects container lifecycle, Swarm mode introduces additional orchestration-layer behaviors.

From production observations and Docker source code analysis, here's what actually occurs when a Swarm task becomes unhealthy:

  • Traffic Routing: Swarm's ingress network stops routing new requests to unhealthy tasks while maintaining existing connections
  • Service Reconciliation: The orchestrator initiates a new task replacement while keeping the unhealthy one running during graceful period
  • DNS Resolution: Swarm's internal DNS gradually removes unhealthy tasks from SRV records

Here's a production-grade HEALTHCHECK configuration for a web service:

FROM nginx:alpine

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s \
  --retries=3 CMD curl -f http://localhost/health || exit 1

COPY nginx.conf /etc/nginx/nginx.conf

While the Swarm API doesn't expose task health directly, you can monitor it through:

# Container-level health inspection
docker inspect --format '{{json .State.Health}}' container_id

# Service-level health summary
docker service ps --format "{{.Name}} {{.CurrentState}}" service_name

For stateful services, consider this circuit-breaker pattern:

HEALTHCHECK --interval=10s \
  CMD bash -c "[[ -f /tmp/maintenance ]] && exit 0 || \
  (curl -sf http://localhost/health || exit 1)"
  • Health check failures won't appear in docker service logs - check container logs instead
  • Use --health-cmd parameter in docker service create for runtime overrides
  • Set --health-start-period longer than your service's warmup time

For a database cluster with primary-replica setup:

# For primary node
HEALTHCHECK CMD pg_isready -U postgres -d primary_db

# For replica nodes
HEALTHCHECK CMD bash -c "pg_isready -U postgres -d replica_db && \
  [[ $(psql -U postgres -tAc 'SELECT pg_is_in_recovery()') = 't' ]]"

When working with Docker Swarm, the HEALTHCHECK directive serves as a crucial mechanism for container lifecycle management. Unlike standalone Docker containers, Swarm mode implements health checks with specific orchestration behaviors:


# Example HEALTHCHECK in Dockerfile
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8080/health || exit 1

Through extensive testing and Docker source code analysis, I've observed these concrete behaviors:

  • Traffic Routing: Swarm's ingress network stops routing new requests to unhealthy tasks
  • Service Reconciliation: The swarm manager initiates new task creation when health checks fail
  • Grace Period: There's typically a 30-second delay before replacement begins

Here's how we implement robust health checks in production:


version: '3.8'

services:
  webapp:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 15s
      timeout: 3s
      retries: 3
      start_period: 40s
    deploy:
      replicas: 3
      update_config:
        failure_action: rollback
      restart_policy:
        condition: any

For troubleshooting, these commands prove invaluable:


# Check container health status
docker inspect --format='{{json .State.Health}}' container_name

# View swarm service health states
docker service ps --format "table {{.Name}}\t{{.CurrentState}}\t{{.HealthStatus}}" service_name

For stateful services requiring special handling:


services:
  database:
    image: postgres:13
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      start_period: 2m
    deploy:
      placement:
        constraints:
          - node.role == manager
      restart_policy:
        condition: on-failure
        delay: 20s