When working with scaled Kubernetes deployments, unexpected Pod recreations can be particularly frustrating because the terminated Pods disappear completely from the system. Unlike container restarts (which preserve the Pod), complete Pod recreations wipe all evidence of the previous instance.
Here are the most effective ways to investigate Pod recreations:
# Check recent events cluster-wide (last 30 minutes by default)
kubectl get events --all-namespaces --sort-by='.metadata.creationTimestamp'
# Check events for a specific namespace with extended timeframe
kubectl get events -n your-namespace --field-selector involvedObject.kind=Pod --sort-by='.metadata.creationTimestamp'
For long-term monitoring, implement this solution to capture events before they expire (Kubernetes only retains events for 1 hour by default):
apiVersion: apps/v1
kind: Deployment
metadata:
name: event-exporter
spec:
template:
spec:
containers:
- name: event-exporter
image: bitnami/kubectl
command: ["/bin/sh"]
args: ["-c", "while true; do kubectl get events -A --field-selector involvedObject.kind=Pod -o json >> /events/events.log; sleep 300; done"]
volumeMounts:
- name: events-volume
mountPath: /events
volumes:
- name: events-volume
emptyDir: {}
From production experience, these are the most frequent culprits:
- Node pressure evictions (memory/disk)
- Cluster autoscaler operations
- Pod disruption budgets being violated
- Manual kubectl delete operations
- Deployment rolling updates (even without configuration changes)
Create a PDB to get more visibility into voluntary disruptions:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: zookeeper
Enable and examine audit logs to track who/what deleted Pods:
# Sample audit policy configuration
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
resources:
- group: ""
resources: ["pods"]
verbs: ["delete"]
# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
# Look for these critical conditions:
# MemoryPressure: True
# DiskPressure: True
# PIDPressure: True
For proactive monitoring, set up alerts for these metrics:
- kube_pod_deletion_timestamp (Prometheus)
- kube_pod_status_ready (for ready state transitions)
- kube_node_status_condition (for node pressure states)
When working with scaled Kubernetes deployments, one of the most frustrating scenarios is discovering that some pods mysteriously restarted overnight without clear audit trails. Unlike container crashes which leave obvious logs, full pod recreations often occur silently in the background.
Here are the most effective approaches to uncover the root cause:
# Check pod status history across all nodes
kubectl get pods --all-namespaces --field-selector=status.phase!=Running -o wide
# Filter for recently terminated pods
kubectl get pods --all-namespaces --sort-by=.metadata.creationTimestamp | grep -v Running
Several system components might hold clues about pod termination:
- Deployment revision history (for rollout triggers)
- Node system logs (for node pressure evictions)
- Horizontal Pod Autoscaler activity
- Cluster autoscaler decisions
- Custom webhooks or admission controllers
This helper script collects relevant forensic data:
#!/bin/bash
NAMESPACE="your-namespace"
DEPLOYMENT="your-deployment"
# Get deployment rollout history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE
# Check for node pressure events
kubectl get events --sort-by=.metadata.creationTimestamp \
--field-selector involvedObject.kind=Node
# Verify HPA status
kubectl describe hpa -n $NAMESPACE
# Check for pod disruption budgets
kubectl get pdb -n $NAMESPACE
From production experience, these are frequent culprits:
Cause | Detection Method | Example Command |
---|---|---|
Node Drain/Eviction | Node events | kubectl get events -A | grep -i drain |
Resource Starvation | Pod status 'OOMKilled' | kubectl describe pod | grep -i oom |
Anti-Affinity Conflicts | Pod scheduling failures | kubectl get events | grep -i "failed scheduling" |
For persistent cases, consider these deeper investigations:
# Enable Kubernetes audit logging
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
resources:
- group: ""
resources: ["pods"]
# Check kubelet logs on worker nodes
journalctl -u kubelet --since "24 hours ago" | grep -i killing
Implement these practices to minimize unexpected recreations:
- Set appropriate resource requests/limits
- Configure pod disruption budgets
- Implement proper readiness gates
- Monitor cluster autoscaler metrics
- Review deployment update strategies