Debugging Kubernetes Pod Recreations: How to Investigate Unexpected Terminations in Scaled Deployments


2 views

When working with scaled Kubernetes deployments, unexpected Pod recreations can be particularly frustrating because the terminated Pods disappear completely from the system. Unlike container restarts (which preserve the Pod), complete Pod recreations wipe all evidence of the previous instance.

Here are the most effective ways to investigate Pod recreations:

# Check recent events cluster-wide (last 30 minutes by default)
kubectl get events --all-namespaces --sort-by='.metadata.creationTimestamp'

# Check events for a specific namespace with extended timeframe
kubectl get events -n your-namespace --field-selector involvedObject.kind=Pod --sort-by='.metadata.creationTimestamp'

For long-term monitoring, implement this solution to capture events before they expire (Kubernetes only retains events for 1 hour by default):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
spec:
  template:
    spec:
      containers:
      - name: event-exporter
        image: bitnami/kubectl
        command: ["/bin/sh"]
        args: ["-c", "while true; do kubectl get events -A --field-selector involvedObject.kind=Pod -o json >> /events/events.log; sleep 300; done"]
        volumeMounts:
        - name: events-volume
          mountPath: /events
      volumes:
      - name: events-volume
        emptyDir: {}

From production experience, these are the most frequent culprits:

  • Node pressure evictions (memory/disk)
  • Cluster autoscaler operations
  • Pod disruption budgets being violated
  • Manual kubectl delete operations
  • Deployment rolling updates (even without configuration changes)

Create a PDB to get more visibility into voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper

Enable and examine audit logs to track who/what deleted Pods:

# Sample audit policy configuration
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  resources:
  - group: ""
    resources: ["pods"]
  verbs: ["delete"]
# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

# Look for these critical conditions:
# MemoryPressure: True
# DiskPressure: True
# PIDPressure: True

For proactive monitoring, set up alerts for these metrics:

  • kube_pod_deletion_timestamp (Prometheus)
  • kube_pod_status_ready (for ready state transitions)
  • kube_node_status_condition (for node pressure states)

When working with scaled Kubernetes deployments, one of the most frustrating scenarios is discovering that some pods mysteriously restarted overnight without clear audit trails. Unlike container crashes which leave obvious logs, full pod recreations often occur silently in the background.

Here are the most effective approaches to uncover the root cause:

# Check pod status history across all nodes
kubectl get pods --all-namespaces --field-selector=status.phase!=Running -o wide

# Filter for recently terminated pods
kubectl get pods --all-namespaces --sort-by=.metadata.creationTimestamp | grep -v Running

Several system components might hold clues about pod termination:

  • Deployment revision history (for rollout triggers)
  • Node system logs (for node pressure evictions)
  • Horizontal Pod Autoscaler activity
  • Cluster autoscaler decisions
  • Custom webhooks or admission controllers

This helper script collects relevant forensic data:

#!/bin/bash
NAMESPACE="your-namespace"
DEPLOYMENT="your-deployment"

# Get deployment rollout history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE

# Check for node pressure events
kubectl get events --sort-by=.metadata.creationTimestamp \
  --field-selector involvedObject.kind=Node

# Verify HPA status
kubectl describe hpa -n $NAMESPACE

# Check for pod disruption budgets
kubectl get pdb -n $NAMESPACE

From production experience, these are frequent culprits:

Cause Detection Method Example Command
Node Drain/Eviction Node events kubectl get events -A | grep -i drain
Resource Starvation Pod status 'OOMKilled' kubectl describe pod | grep -i oom
Anti-Affinity Conflicts Pod scheduling failures kubectl get events | grep -i "failed scheduling"

For persistent cases, consider these deeper investigations:

# Enable Kubernetes audit logging
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  resources:
  - group: ""
    resources: ["pods"]

# Check kubelet logs on worker nodes
journalctl -u kubelet --since "24 hours ago" | grep -i killing

Implement these practices to minimize unexpected recreations:

  • Set appropriate resource requests/limits
  • Configure pod disruption budgets
  • Implement proper readiness gates
  • Monitor cluster autoscaler metrics
  • Review deployment update strategies