When dealing with Linux systems, you might encounter processes marked as <defunct>
in your process list. These are zombie processes - processes that have completed execution but still have an entry in the process table. The particular challenge comes when these zombies have parent PID 1 (init), making them resistant to normal termination methods.
Running kill -9
on a zombie process typically has no effect because:
# This won't work on zombie processes
kill -9 PID
kill -15 PID
Zombies are already dead processes - they're just entries waiting to be reaped by their parent. The kernel maintains them until the parent process reads their exit status.
Before attempting to clean up, verify the process state:
ps aux | grep defunct
ps -eo pid,ppid,stat,cmd | grep defunct
For our specific Bacula case, we can see:
[root@backup ~]# ps -ef | grep defunct
root 5825 1 0 Oct18 ? 00:00:00 [bacula-sd] <defunct>
Here are several approaches to handle zombie processes with parent PID 1:
Method 1: Kill the Parent Process (When Possible)
If the parent isn't init (PID 1), you can try:
kill -HUP PPID
Method 2: Use SIGCHLD to Init
For parent PID 1 cases, try:
kill -CHLD 1
This signals init to clean up its zombie children.
Method 3: Force Kernel Reaping
As a last resort, you can reboot or use:
echo 1 > /proc/sys/kernel/sysrq
echo f > /proc/sysrq-trigger
For Bacula specifically, consider these preventive measures:
# In bacula-sd.conf:
Maximum Concurrent Jobs = 20
Also ensure proper signal handling in your applications to prevent zombie creation.
For persistent issues, use strace to monitor process behavior:
strace -p PID
Or examine kernel messages:
dmesg | grep -i zombie
When you encounter a defunct process (also called a zombie process) that won't die even with SIGKILL (kill -9), it typically means the process has completed execution but still has an entry in the process table. This usually happens when:
- The parent process didn't properly wait() for its child
- The parent process died before collecting the child's exit status
- The process is holding onto system resources (like in your case with TCP connections)
From your example, using kill -9 5825
didn't work because:
1. The process is already dead (zombie state)
2. The parent (init with PID 1) hasn't reaped it
3. There are lingering file descriptors (shown in your lsof output)
Method 1: Kill the Parent Process (When Safe)
For processes with PPID 1 (init), we need a different approach:
# First try SIGTERM to the parent
sudo kill -15 1
# If that doesn't work (which it usually shouldn't for init), try:
sudo kill -1 1 # SIGHUP to reinit processes
Method 2: Release Resources Manually
From your lsof output, we see open TCP connections in CLOSE_WAIT state:
# List all open TCP connections
ss -tanp | grep "bacula-sd"
# Force close specific connections (use the inode number from lsof)
grep -E '5825.*CLOSE_WAIT' /proc/net/tcp
echo "close 1023380" > /proc/net/tcp # Example inode
Method 3: Reap the Zombie via ptrace
Advanced method using gdb:
sudo gdb -p 5825
(gdb) call waitpid(5825, 0, 0)
(gdb) detach
(gdb) quit
For Bacula specifically, consider these configuration tweaks:
# In bacula-sd.conf:
MaxConnections = 10
FDUseSocket = no
Heartbeat Interval = 1 minute
If using systemd (modern RedHat):
systemctl reset-failed bacula-sd
systemctl daemon-reload
systemctl restart bacula-sd
After cleanup, verify with:
ps aux | grep '$$bacula-sd$$'
ss -tanp | grep bacula
ls -la /proc/5825/fd 2>/dev/null || echo "Process cleaned"