Debugging Kubernetes Pod Recreations: How to Investigate Unexpected Terminations in Scaled Deployments

When working with scaled Kubernetes deployments, unexpected Pod recreations can be particularly frustrating because the terminated Pods disappear completely from the system. Unlike container restarts (which preserve the Pod), complete Pod recreations wipe all evidence of the previous instance.

Here are the most effective ways to investigate Pod recreations:

# Check recent events cluster-wide (last 30 minutes by default)
kubectl get events --all-namespaces --sort-by='.metadata.creationTimestamp'

# Check events for a specific namespace with extended timeframe
kubectl get events -n your-namespace --field-selector involvedObject.kind=Pod --sort-by='.metadata.creationTimestamp'

For long-term monitoring, implement this solution to capture events before they expire (Kubernetes only retains events for 1 hour by default):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
spec:
  template:
    spec:
      containers:
      - name: event-exporter
        image: bitnami/kubectl
        command: ["/bin/sh"]
        args: ["-c", "while true; do kubectl get events -A --field-selector involvedObject.kind=Pod -o json >> /events/events.log; sleep 300; done"]
        volumeMounts:
        - name: events-volume
          mountPath: /events
      volumes:
      - name: events-volume
        emptyDir: {}

From production experience, these are the most frequent culprits:

Node pressure evictions (memory/disk)
Cluster autoscaler operations
Pod disruption budgets being violated
Manual kubectl delete operations
Deployment rolling updates (even without configuration changes)

Create a PDB to get more visibility into voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper

Enable and examine audit logs to track who/what deleted Pods:

# Sample audit policy configuration
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  resources:
  - group: ""
    resources: ["pods"]
  verbs: ["delete"]

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

# Look for these critical conditions:
# MemoryPressure: True
# DiskPressure: True
# PIDPressure: True

For proactive monitoring, set up alerts for these metrics:

kube_pod_deletion_timestamp (Prometheus)
kube_pod_status_ready (for ready state transitions)
kube_node_status_condition (for node pressure states)

When working with scaled Kubernetes deployments, one of the most frustrating scenarios is discovering that some pods mysteriously restarted overnight without clear audit trails. Unlike container crashes which leave obvious logs, full pod recreations often occur silently in the background.

Here are the most effective approaches to uncover the root cause:

# Check pod status history across all nodes
kubectl get pods --all-namespaces --field-selector=status.phase!=Running -o wide

# Filter for recently terminated pods
kubectl get pods --all-namespaces --sort-by=.metadata.creationTimestamp | grep -v Running

Several system components might hold clues about pod termination:

Deployment revision history (for rollout triggers)
Node system logs (for node pressure evictions)
Horizontal Pod Autoscaler activity
Cluster autoscaler decisions
Custom webhooks or admission controllers

This helper script collects relevant forensic data:

#!/bin/bash
NAMESPACE="your-namespace"
DEPLOYMENT="your-deployment"

# Get deployment rollout history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE

# Check for node pressure events
kubectl get events --sort-by=.metadata.creationTimestamp \
  --field-selector involvedObject.kind=Node

# Verify HPA status
kubectl describe hpa -n $NAMESPACE

# Check for pod disruption budgets
kubectl get pdb -n $NAMESPACE

From production experience, these are frequent culprits:

Cause	Detection Method	Example Command
Node Drain/Eviction	Node events	kubectl get events -A \| grep -i drain
Resource Starvation	Pod status 'OOMKilled'	kubectl describe pod \| grep -i oom
Anti-Affinity Conflicts	Pod scheduling failures	kubectl get events \| grep -i "failed scheduling"

For persistent cases, consider these deeper investigations:

# Enable Kubernetes audit logging
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  resources:
  - group: ""
    resources: ["pods"]

# Check kubelet logs on worker nodes
journalctl -u kubelet --since "24 hours ago" | grep -i killing

Implement these practices to minimize unexpected recreations:

Set appropriate resource requests/limits
Configure pod disruption budgets
Implement proper readiness gates
Monitor cluster autoscaler metrics
Review deployment update strategies

ServerDevWorker

Debugging Kubernetes Pod Recreations: How to Investigate Unexpected Terminations in Scaled Deployments

Related Articles