Top Nagios Plugins for Infrastructure Monitoring: Essential NRPE and Performance Check Tools


2 views

When expanding Nagios 3 setups, check_load consistently ranks as the most fundamental performance monitoring plugin. This NRPE-compatible tool provides critical system load averages across 1, 5, and 15-minute intervals with configurable thresholds.

No infrastructure monitoring is complete without proper disk capacity tracking. The check_disk plugin offers:

define command {
    command_name    check_nrpe_disk
    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk -a '-w $ARG1$ -c $ARG2$ -p $ARG3$'
}

check_mem.pl (available on MonitoringExchange) provides detailed memory analysis beyond basic free/used metrics:

./check_mem.pl -w 90 -c 95 -f -C

check_iftraffic stands out for interface monitoring with these capabilities:

  • Bandwidth thresholding
  • Error rate detection
  • Multi-interface support

For MySQL environments, check_mysql_health provides over 50 specialized checks:

define service {
    use                 generic-service
    host_name           db-server
    service_description MySQL Connections
    check_command       check_nrpe!check_mysql_health!--mode connection-usage
}

check_procs offers granular process monitoring with zombie detection and resource usage tracking:

command[check_procs]=/usr/lib/nagios/plugins/check_procs -w 400 -c 600 -s Z

When creating custom NRPE plugins, follow these best practices:

  1. Implement proper exit codes (0=OK, 1=WARN, 2=CRIT)
  2. Include performance data output
  3. Use threshold syntax: -w 80 -c 90

html

In 15+ years of Nagios implementations, I've found that 80% of monitoring value comes from just 20% of plugins. The real power lies in combining essential system checks with custom business logic. Here's my battle-tested toolkit:

Why it's gold: The standard check_nrpe has limitations in SSL handling and timeout management. This fork adds:

# Sample command definition:
define command {
    command_name check_nrpe_plus
    command_line /usr/lib/nagios/plugins/check_nrpe_plus -H $HOSTADDRESS$ -t 30 -n -c $ARG1$ -a $ARG2$
}

Key advantage: Supports argument passing without messy sed hacks. I use it for custom DB checks:

./check_nrpe_plus -H mysql01 -c check_mysql_slave -a "--warning=10 --critical=30"

When you need atomic execution of multiple checks (e.g., during maintenance windows):

# config.cfg snippet:
command[check_apache_stack] = /usr/local/bin/check_multi -f /etc/nagios/apache_stack.cfg

# apache_stack.cfg:
$ttl = 60
$command1 = check_http -H localhost -u /server-status
$command2 = check_procs -w 10:30 -c 5:50 -C httpd

Pro tip: Combine with check_disk and check_load for full service context.

Goes beyond simple tail with:

  • Native regex support
  • State retention across restarts
  • Multi-line pattern matching
define service {
    use                  generic-service
    host_name            appserver-*
    service_description  Error Log Monitor
    check_command        check_logfiles! 
        --tag=apache_errors 
        --logfile=/var/log/httpd/error_log 
        --criticalpattern='(500|segmentation fault)' 
        --warningpattern='(404|client denied)'
}

The standard ping check doesn't cut it for modern networks. This version adds:

./check_icmp -H router1 --loss=2,5 --delay=50,100 --jitter=20

Critical for VoIP and trading systems where jitter matters more than simple latency.

When existing plugins don't fit, I use this Python template:

#!/usr/bin/env python3
import argparse
from sys import exit

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-w", "--warning", type=float)
    parser.add_argument("-c", "--critical", type=float)
    args = parser.parse_args()
    
    # Your check logic here
    value = get_metric()
    
    if value >= args.critical:
        print(f"CRITICAL: {value} exceeds threshold")
        exit(2)
    elif value >= args.warning:
        print(f"WARNING: {value} exceeds threshold")
        exit(1)
    else:
        print(f"OK: {value} within bounds")
        exit(0)

if __name__ == "__main__":
    main()
  • Version control all custom checks (git submodules work great)
  • Standardize on either Bash or Python for consistency
  • Implement a plugin test harness before production deployment