How to Monitor and Kill Long-Running Processes by Name in Monit Without PID Files


2 views

When dealing with multiple process instances of the same executable, traditional monitoring approaches using PID files become problematic. Monit typically relies on PID files for process tracking, but when you have dynamically spawned processes (like worker processes or background tasks), you need a more flexible solution.

Monit can match processes by name and examine their runtime. Here's how to implement process killing based on runtime:

check process my_service
    matching "my_service_executable"
    if uptime > 2 minutes then exec "/bin/kill -9 pgrep -f 'my_service_executable'"

This approach combines several Unix tools with Monit's capabilities:

  • matching - Monit's pattern matching for process names
  • uptime - Checks how long the process has been running
  • pgrep -f - Finds all processes by full command pattern

For more complex scenarios, you might want to combine runtime with CPU usage:

check process my_service
    matching "my_service_executable"
    if cpu > 90% for 5 cycles then alert
    if uptime > 30 minutes and cpu > 70% then exec "/usr/bin/pkill -f 'my_service_executable'"

For more control, you could create a custom script and have Monit execute it:

#!/bin/bash
# kill_long_running.sh
PROCESS_NAME=$1
MAX_MINUTES=$2

ps -eo pid,etime,comm | awk -v proc="$PROCESS_NAME" -v max="$MAX_MINUTES" \
'$3 ~ proc { 
    split($2, t, ":"); 
    mins = t[1]*60 + t[2]; 
    if (mins > max) print $1 
}' | xargs kill -9

Then in monit:

check program kill_long_running with path "/path/to/kill_long_running.sh my_service_executable 5"
    if status != 0 then alert
  • Test your matching pattern thoroughly - some processes might have similar names
  • Consider system load when killing multiple processes
  • Log your actions for debugging purposes
  • For production systems, implement gradual restart strategies

When managing multiple instances of an application, we often encounter processes that enter a bad state and consume excessive CPU resources indefinitely. These "zombie processes" typically don't have PID files, making them difficult to track and manage through conventional monitoring tools.

Monit provides powerful process matching functionality that goes beyond simple PID file checking. We can leverage its pattern matching to identify and manage rogue processes:

check process my_app
    matching "my_app_pattern"
    if uptime > 2 minutes then exec "/bin/kill -TERM pgrep -f my_app_pattern | xargs"

For more precise control, we can combine multiple conditions to target specific process instances:

check process long_runners
    matching "^/usr/bin/my_app --daemon"
    if cpu > 90% for 5 cycles then alert
    if uptime > 120 seconds then exec "/usr/bin/pkill -f '^/usr/bin/my_app --daemon'"

When dealing with process trees or worker pools, we need a more sophisticated approach:

check process worker_pool
    matching "worker_.*"
    every 2 cycles
    if totalmem > 500 MB for 3 cycles then exec "/usr/bin/pkill -f 'worker_.*'"
    if uptime > 300 seconds then exec "/usr/bin/pkill -f 'worker_.*'"

Always include safeguards to prevent accidental killing of critical processes:

check process critical_app
    matching "/opt/critical/app"
    if uptime > 3600 seconds then alert
    if uptime > 7200 seconds then exec "/usr/local/bin/safe_kill.sh /opt/critical/app"

For complex scenarios, consider combining Monit with a simple watchdog script:

#!/bin/bash
# watchdog.sh
PROCESS="target_app"
MAX_UPTIME=120

ps -eo pid,etime,args | grep "$PROCESS" | while read -r pid etime cmd; do
    # Convert elapsed time to seconds
    seconds=$(echo "$etime" | awk -F: '{ if (NF == 2) {print $1*60 + $2} else if (NF==3) {print $1*3600 + $2*60 + $3} }')
    if [ "$seconds" -gt "$MAX_UPTIME" ]; then
        kill -9 "$pid"
    fi
done

Then configure Monit to run this script periodically:

check program watchdog with path "/usr/local/bin/watchdog.sh"
    every 5 cycles
    if status != 0 then alert