Nagios Plugin Development: Checking Service Process Status with check_procs


2 views

When managing enterprise infrastructure, ensuring critical services like sendmail, xinetd, automount, and samba remain operational is paramount. While these services are generally stable, unexpected crashes can occur, making monitoring essential.

Nagios Core comes with a powerful generic plugin called check_procs that can monitor any running process. The basic syntax is:

define command {
    command_name    check_procs
    command_line    $USER1$/check_procs -w $ARG1$ -c $ARG2$ -C $ARG3$
}

Here's how to configure service checks for various critical processes:

Monitoring Sendmail

define service {
    use                  generic-service
    host_name            mail-server
    service_description  Sendmail Process
    check_command        check_procs!1:1!1!sendmail
}

Checking Samba Services

define service {
    use                  generic-service
    host_name            file-server
    service_description  Samba Process
    check_command        check_procs!1:1!1!smbd
}

For more complex scenarios, you can combine multiple checks:

# Check OpenVPN with exact process count
define command {
    command_name    check_openvpn
    command_line    $USER1$/check_procs -w 1:1 -c 1 -C openvpn
}

# Check ClamAV with argument matching
define command {
    command_name    check_clamd
    command_line    $USER1$/check_procs -w 1:1 -c 1 -a '/usr/sbin/clamd'
}

For services that spawn multiple processes (like xinetd):

define service {
    use                  generic-service
    host_name            network-server
    service_description  Xinetd Processes
    check_command        check_procs!3:5!1:2!xinetd
}

For services not easily detectable via process name, consider writing custom plugins:

#!/bin/bash
# check_mcafee.sh
if pgrep -x "uvscan" >/dev/null
then
    echo "OK: McAfee processes running"
    exit 0
else
    echo "CRITICAL: McAfee not running"
    exit 2
fi

When monitoring numerous processes, consider:

  • Combining related checks into single service definitions
  • Using process group monitoring instead of individual checks
  • Adjusting check intervals based on service criticality

When managing critical infrastructure, simply checking if a service is listening on a port isn't always sufficient. Many essential services like sendmail, xinetd, or openvpn can appear "up" while their actual processing components might have crashed. This is where process-level monitoring becomes crucial.

The Nagios Exchange contains specialized plugins for popular services, but many critical but less-common services lack dedicated monitoring solutions. While you could use check_procs for basic checks, it often requires complex command-line arguments and lacks service-specific intelligence.

Here's a Python-based solution that provides more flexibility than the standard check_procs:

#!/usr/bin/env python3
import sys
import psutil
from optparse import OptionParser

def check_process(process_name, min_count=1, max_count=None, exact_match=False):
    count = 0
    for proc in psutil.process_iter(['name', 'cmdline']):
        try:
            if exact_match:
                if proc.name() == process_name:
                    count += 1
            else:
                if process_name.lower() in ' '.join(proc.cmdline()).lower():
                    count += 1
        except (psutil.NoSuchProcess, psutil.AccessDenied):
            continue
    
    if max_count is None:
        max_count = min_count
    
    if count < min_count:
        print(f"CRITICAL: Only {count} {process_name} processes (needs at least {min_count})")
        return 2
    elif count > max_count:
        print(f"WARNING: {count} {process_name} processes (should be at most {max_count})")
        return 1
    else:
        print(f"OK: {count} {process_name} processes running")
        return 0

if __name__ == "__main__":
    parser = OptionParser()
    parser.add_option("-p", "--process", dest="process_name", help="Process name to check")
    parser.add_option("-m", "--min", dest="min_count", default=1, type="int", help="Minimum required processes")
    parser.add_option("-M", "--max", dest="max_count", type="int", help="Maximum allowed processes")
    parser.add_option("-e", "--exact", action="store_true", dest="exact_match", default=False, help="Exact process name match")
    
    (options, args) = parser.parse_args()
    
    if not options.process_name:
        parser.error("Process name not specified")
    
    sys.exit(check_process(
        options.process_name,
        options.min_count,
        options.max_count,
        options.exact_match
    ))
  • Flexible matching: Can match by exact process name or search command lines
  • Count thresholds: Set minimum and maximum allowed process counts
  • Python-based: Easier to maintain and extend than shell scripts
  • psutil library: More reliable than parsing ps output

Here are some example Nagios command definitions for common services:

# Check for at least 1 sendmail process
define command {
    command_name    check_sendmail_process
    command_line    /usr/local/nagios/libexec/check_process.py -p sendmail -e
}

# Check for exactly 3 clamd processes
define command {
    command_name    check_clamd_process
    command_line    /usr/local/nagios/libexec/check_process.py -p clamd -m 3 -M 3
}

# Check for samba processes (non-exact match)
define command {
    command_name    check_smbd_process
    command_line    /usr/local/nagios/libexec/check_process.py -p smbd -m 1
}

For environments with thousands of processes:

  • Cache the process list if checking multiple services
  • Consider running the check less frequently for non-critical services
  • On Linux systems, /proc parsing might be faster than psutil

For production environments, you might also consider:

  1. Systemd-based checks using systemctl is-active
  2. Supervisor process monitoring
  3. Kernel audit system integration