How to Diagnose and Monitor Unexpected Kubernetes Pod Restarts: Debugging Techniques & Alert Setup


3 views

When investigating unexpected pod restarts in Kubernetes, start by examining the pod's lifecycle events:

kubectl describe pod [pod-name] -n [namespace]

Look for sections like:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  5m                 default-scheduler  Successfully assigned default/nginx to gke-cluster
  Warning  Unhealthy  2m (x3 over 4m)    kubelet            Liveness probe failed: HTTP probe failed
  Normal   Killing    2m                 kubelet            Container nginx failed liveness probe, will be restarted

Based on your single-node GKE setup, these are likely culprits:

  • Resource constraints - Check if your pod is hitting memory or CPU limits
  • Failed health checks - Review your liveness/readiness probe configurations
  • Node pressure - Even in single-node clusters, system components can evict pods
  • Application crashes - Check application logs for uncaught exceptions

Create a Prometheus alert rule to detect frequent restarts:

groups:
- name: pod-restart-alerts
  rules:
  - alert: FrequentPodRestarts
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} is restarting frequently"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in the last 5 minutes"

Combine multiple approaches for effective troubleshooting:

# Get previous container logs if pod crashed
kubectl logs [pod-name] --previous

# Check resource usage history
kubectl top pod [pod-name] --containers

# View OOM killer events (if memory related)
kubectl get events --field-selector=reason=OOMKilling

For your personal website deployment, add these safeguards:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: website
spec:
  replicas: 2  # Basic HA
  template:
    spec:
      containers:
      - name: web
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10

For immediate alerts, create a Kubernetes Event Exporter with Slack integration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
spec:
  template:
    spec:
      containers:
      - name: event-exporter
        image: ghcr.io/resmo/kubernetes-event-exporter:latest
        env:
        - name: SLACK_WEBHOOK_URL
          value: "https://hooks.slack.com/services/..."
        args:
        - --config=/etc/config.yaml
        volumeMounts:
        - name: config-volume
          mountPath: /etc/config.yaml
          subPath: config.yaml
      volumes:
      - name: config-volume
        configMap:
          name: event-exporter-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: event-exporter-config
data:
  config.yaml: |
    logLevel: debug
    routes:
      - match:
          - reason: "Started"
          - reason: "Killing"
          - reason: "BackOff"
          - reason: "Unhealthy"
        sink: slack
    sinks:
      - name: slack
        slack:
          webhookurl: ${SLACK_WEBHOOK_URL}
          message: "Event: {reason}\nPod: {involvedObject.name}\nNamespace: {involvedObject.namespace}\nMessage: {message}"

When running applications in Kubernetes, unexpected container restarts can occur for various reasons. The key is to understand the root cause and implement proper monitoring. Here's how to investigate:

First, examine your pod's status and restart count:

kubectl get pods --all-namespaces
kubectl describe pod [POD_NAME]

Look for the Restart Count field and Last State in the container status section.

Examine both current and previous container logs:

kubectl logs [POD_NAME] --previous
kubectl logs [POD_NAME] --tail=50
  • OOMKilled (Out of Memory)
  • CrashLoopBackOff
  • Liveness probe failures
  • Node resource pressure
  • Manual pod eviction

Create Prometheus alerts for container restarts:

groups:
- name: container-restarts
  rules:
  - alert: HighContainerRestarts
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Container {{ $labels.container }} in pod {{ $labels.pod }} is restarting frequently"
      description: "Container {{ $labels.container }} in pod {{ $labels.pod }} has restarted {{ $value }} times in the last 5 minutes"

For OOM issues, check container memory limits:

kubectl get pod [POD_NAME] -o json | jq '.spec.containers[].resources'

For liveness probe failures:

kubectl describe pod [POD_NAME] | grep -A 10 "Liveness"

Enable Kubernetes events monitoring:

kubectl get events --sort-by='.metadata.creationTimestamp'
kubectl get events --field-selector involvedObject.kind=Pod

For deeper investigation, check kubelet logs on the node:

journalctl -u kubelet --no-pager -n 100
  • Implement proper resource requests/limits
  • Configure appropriate liveness/readiness probes
  • Set up pod disruption budgets for critical workloads
  • Monitor node resource utilization

Create a simple script to monitor and notify about restarts:

#!/bin/bash
while true; do
  kubectl get pods -o json | jq -r '.items[] | select(.status.containerStatuses[].restartCount > 0) | .metadata.name' | while read pod; do
    echo "ALERT: $pod has restarted"
    # Add your notification logic here (email, Slack, etc.)
  done
  sleep 60
done