How to Properly Clean Up and Kill a Zombie Process (Defunct) with Parent PID 1 in Linux


1 views

When dealing with Linux systems, you might encounter processes marked as <defunct> in your process list. These are zombie processes - processes that have completed execution but still have an entry in the process table. The particular challenge comes when these zombies have parent PID 1 (init), making them resistant to normal termination methods.

Running kill -9 on a zombie process typically has no effect because:


# This won't work on zombie processes
kill -9 PID
kill -15 PID

Zombies are already dead processes - they're just entries waiting to be reaped by their parent. The kernel maintains them until the parent process reads their exit status.

Before attempting to clean up, verify the process state:


ps aux | grep defunct
ps -eo pid,ppid,stat,cmd | grep defunct

For our specific Bacula case, we can see:


[root@backup ~]# ps -ef | grep defunct
root      5825     1  0 Oct18 ?        00:00:00 [bacula-sd] <defunct>

Here are several approaches to handle zombie processes with parent PID 1:

Method 1: Kill the Parent Process (When Possible)

If the parent isn't init (PID 1), you can try:


kill -HUP PPID

Method 2: Use SIGCHLD to Init

For parent PID 1 cases, try:


kill -CHLD 1

This signals init to clean up its zombie children.

Method 3: Force Kernel Reaping

As a last resort, you can reboot or use:


echo 1 > /proc/sys/kernel/sysrq
echo f > /proc/sysrq-trigger

For Bacula specifically, consider these preventive measures:


# In bacula-sd.conf:
Maximum Concurrent Jobs = 20

Also ensure proper signal handling in your applications to prevent zombie creation.

For persistent issues, use strace to monitor process behavior:


strace -p PID

Or examine kernel messages:


dmesg | grep -i zombie

When you encounter a defunct process (also called a zombie process) that won't die even with SIGKILL (kill -9), it typically means the process has completed execution but still has an entry in the process table. This usually happens when:

  • The parent process didn't properly wait() for its child
  • The parent process died before collecting the child's exit status
  • The process is holding onto system resources (like in your case with TCP connections)

From your example, using kill -9 5825 didn't work because:

1. The process is already dead (zombie state)
2. The parent (init with PID 1) hasn't reaped it
3. There are lingering file descriptors (shown in your lsof output)

Method 1: Kill the Parent Process (When Safe)

For processes with PPID 1 (init), we need a different approach:

# First try SIGTERM to the parent
sudo kill -15 1

# If that doesn't work (which it usually shouldn't for init), try:
sudo kill -1 1  # SIGHUP to reinit processes

Method 2: Release Resources Manually

From your lsof output, we see open TCP connections in CLOSE_WAIT state:

# List all open TCP connections
ss -tanp | grep "bacula-sd"

# Force close specific connections (use the inode number from lsof)
grep -E '5825.*CLOSE_WAIT' /proc/net/tcp
echo "close 1023380" > /proc/net/tcp  # Example inode

Method 3: Reap the Zombie via ptrace

Advanced method using gdb:

sudo gdb -p 5825
(gdb) call waitpid(5825, 0, 0)
(gdb) detach
(gdb) quit

For Bacula specifically, consider these configuration tweaks:

# In bacula-sd.conf:
MaxConnections = 10
FDUseSocket = no
Heartbeat Interval = 1 minute

If using systemd (modern RedHat):

systemctl reset-failed bacula-sd
systemctl daemon-reload
systemctl restart bacula-sd

After cleanup, verify with:

ps aux | grep '$$bacula-sd$$' 
ss -tanp | grep bacula
ls -la /proc/5825/fd 2>/dev/null || echo "Process cleaned"