When expanding Nagios 3 setups, check_load
consistently ranks as the most fundamental performance monitoring plugin. This NRPE-compatible tool provides critical system load averages across 1, 5, and 15-minute intervals with configurable thresholds.
No infrastructure monitoring is complete without proper disk capacity tracking. The check_disk
plugin offers:
define command {
command_name check_nrpe_disk
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk -a '-w $ARG1$ -c $ARG2$ -p $ARG3$'
}
check_mem.pl
(available on MonitoringExchange) provides detailed memory analysis beyond basic free/used metrics:
./check_mem.pl -w 90 -c 95 -f -C
check_iftraffic
stands out for interface monitoring with these capabilities:
- Bandwidth thresholding
- Error rate detection
- Multi-interface support
For MySQL environments, check_mysql_health
provides over 50 specialized checks:
define service {
use generic-service
host_name db-server
service_description MySQL Connections
check_command check_nrpe!check_mysql_health!--mode connection-usage
}
check_procs
offers granular process monitoring with zombie detection and resource usage tracking:
command[check_procs]=/usr/lib/nagios/plugins/check_procs -w 400 -c 600 -s Z
When creating custom NRPE plugins, follow these best practices:
- Implement proper exit codes (0=OK, 1=WARN, 2=CRIT)
- Include performance data output
- Use threshold syntax: -w 80 -c 90
html
In 15+ years of Nagios implementations, I've found that 80% of monitoring value comes from just 20% of plugins. The real power lies in combining essential system checks with custom business logic. Here's my battle-tested toolkit:
Why it's gold: The standard check_nrpe has limitations in SSL handling and timeout management. This fork adds:
# Sample command definition:
define command {
command_name check_nrpe_plus
command_line /usr/lib/nagios/plugins/check_nrpe_plus -H $HOSTADDRESS$ -t 30 -n -c $ARG1$ -a $ARG2$
}
Key advantage: Supports argument passing without messy sed hacks. I use it for custom DB checks:
./check_nrpe_plus -H mysql01 -c check_mysql_slave -a "--warning=10 --critical=30"
When you need atomic execution of multiple checks (e.g., during maintenance windows):
# config.cfg snippet:
command[check_apache_stack] = /usr/local/bin/check_multi -f /etc/nagios/apache_stack.cfg
# apache_stack.cfg:
$ttl = 60
$command1 = check_http -H localhost -u /server-status
$command2 = check_procs -w 10:30 -c 5:50 -C httpd
Pro tip: Combine with check_disk and check_load for full service context.
Goes beyond simple tail with:
- Native regex support
- State retention across restarts
- Multi-line pattern matching
define service {
use generic-service
host_name appserver-*
service_description Error Log Monitor
check_command check_logfiles!
--tag=apache_errors
--logfile=/var/log/httpd/error_log
--criticalpattern='(500|segmentation fault)'
--warningpattern='(404|client denied)'
}
The standard ping check doesn't cut it for modern networks. This version adds:
./check_icmp -H router1 --loss=2,5 --delay=50,100 --jitter=20
Critical for VoIP and trading systems where jitter matters more than simple latency.
When existing plugins don't fit, I use this Python template:
#!/usr/bin/env python3
import argparse
from sys import exit
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-w", "--warning", type=float)
parser.add_argument("-c", "--critical", type=float)
args = parser.parse_args()
# Your check logic here
value = get_metric()
if value >= args.critical:
print(f"CRITICAL: {value} exceeds threshold")
exit(2)
elif value >= args.warning:
print(f"WARNING: {value} exceeds threshold")
exit(1)
else:
print(f"OK: {value} within bounds")
exit(0)
if __name__ == "__main__":
main()
- Version control all custom checks (git submodules work great)
- Standardize on either Bash or Python for consistency
- Implement a plugin test harness before production deployment