Debugging Kubernetes CrashLoopBackOff: Why Your Pod Keeps Restarting and How to Fix It


6 views

When your Kubernetes pod enters a CrashLoopBackOff state, it means the container starts but then crashes repeatedly. Kubernetes then implements an exponential backoff delay between restart attempts. From your description, we can see the pod has restarted 72 times in 5 hours, which indicates a serious issue.

Before diving deep, let's check the basic troubleshooting steps you should always perform:

# Get pod details
kubectl describe pod quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4

# Check container logs (even if they seem empty)
kubectl logs quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4 --previous

# Check events at cluster level
kubectl get events --sort-by='.metadata.creationTimestamp'

From your event logs, we can see the pattern:

  1. Container creates successfully
  2. Container starts successfully
  3. Then crashes shortly after
  4. Kubernetes attempts to restart with increasing delays

The key observation here is that the container starts but then exits. This typically indicates one of several common issues:

1. Application Crashes Immediately

Your application might be throwing an uncaught exception or failing some startup check. Try:

# Run the container locally in debug mode
docker run -it --entrypoint=/bin/sh us.gcr.io/skywatch-app/quasar-api-staging:15.0
# Then manually start your application to see errors

2. Missing Dependencies or Configuration

The pod might be missing:

  • Environment variables
  • ConfigMaps or Secrets
  • Volume mounts

Check your deployment YAML for these requirements:

# Example of checking environment variables
kubectl set env deployment/quasar-api-staging --list

3. Resource Constraints

Your container might be getting OOMKilled. Check:

kubectl describe pod quasar-api-staging-... | grep -i "oom"

Using Ephemeral Containers for Debugging

Kubernetes 1.18+ allows adding debug containers to running pods:

kubectl debug -it quasar-api-staging-... --image=busybox --target=quasar-api-staging

Checking Container Exit Codes

The exit code can reveal why your application failed:

kubectl get pod quasar-api-staging-... -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

To avoid future CrashLoopBackOff scenarios:

  1. Implement proper logging in your application
  2. Add health checks (readiness and liveness probes)
  3. Set appropriate resource requests and limits
  4. Test your container images locally before deployment

Here's an example of a good deployment configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: quasar-api-staging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: quasar-api
  template:
    metadata:
      labels:
        app: quasar-api
    spec:
      containers:
      - name: quasar-api-staging
        image: us.gcr.io/skywatch-app/quasar-api-staging:15.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

When your Kubernetes pod shows CrashLoopBackOff status, it means the container starts but crashes repeatedly, triggering Kubernetes' backoff timer. The pod you described (quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4) exhibits classic symptoms:

NAME                                        READY STATUS         RESTARTS AGE
quasar-api-staging-...-3r7v4   0/1 CrashLoopBackOff 72        5h

First, gather detailed information about the failing pod:

# Get pod events
kubectl describe pod quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4

# Check container logs (even if they appear empty)
kubectl logs quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4 --previous

# Get pod configuration
kubectl get pod quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4 -o yaml

Based on your pod events, we can identify several potential issues:

  • Application crashes immediately after startup
  • Missing environment variables or configuration
  • Resource constraints (CPU/memory limits too low)
  • Dependency services unavailable

When regular logs don't show the error, try these approaches:

# Run a debug container in the pod's namespace
kubectl debug -it quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4 --image=busybox -- sh

# Check mounted volumes
kubectl exec quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4 -- ls /path/to/mount

# Test connectivity to dependencies
kubectl exec quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4 -- curl http://dependency-service

Here's an example deployment configuration that includes proper liveness probes and resource limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: quasar-api-staging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: quasar-api-staging
  template:
    metadata:
      labels:
        app: quasar-api-staging
    spec:
      containers:
      - name: quasar-api-staging
        image: us.gcr.io/skywatch-app/quasar-api-staging:15.0
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        resources:
          limits:
            memory: "512Mi"
            cpu: "500m"
          requests:
            memory: "256Mi"
            cpu: "250m"

Since your application runs locally but fails in the cluster, consider these differences:

  • Environment variables (use kubectl set env to verify)
  • Network policies and service meshes
  • Volume mounts and permissions
  • Cluster-specific configurations

To compare environments, run:

# Get all environment variables
kubectl exec quasar-api-staging-14c385ccaff2519688add0c2cb0144b2-3r7v4 -- env