Optimizing Nagios Server Performance: Scalability Analysis for Service Checks on Resource-Constrained Systems


3 views

Your monitoring setup consists of:

  • 42 services across 8 hosts
  • Primary check type: check_http (5-min and 1-min intervals)
  • Concurrent Cacti operations: 6 hosts polled every minute

With 400MB RAM being the critical constraint, consider these metrics:

# Sample check_http memory usage (per process)
$ ps -o rss= -p $(pgrep -f check_http) | awk '{sum+=$1} END {print sum/NR " kB"}'
1264 kB

# Total concurrent processes during peak
$ nagiosenv check_load -w 5 -c 10

Based on empirical data from production deployments:

Hardware Service Capacity Recommended Interval
Single-core 2GHz + 1GB RAM ~80 checks 5-minute baseline
Dual-core 2GHz + 2GB RAM 150-200 checks 1-minute for critical services

Implement these tweaks before hardware upgrades:

# /etc/nagios/nagios.cfg
max_concurrent_checks=12
check_result_reaper_frequency=2
service_check_timeout=30
enable_environment_macros=0

# Distributed monitoring example:
define host {
    host_name    remote_satellite
    address      192.168.1.100
    check_command check_nrpe!load_satellite
}

For environments exceeding 50+ checks:

  1. NRPE Satellites:
    # Sample NRPE configuration
    command[check_http]=/usr/lib/nagios/plugins/check_http -I $ARG1$ -u $ARG2$
    
  2. Modular Setup:
    # Load distribution with NSCA
    /usr/sbin/send_nsca -H nagios_master -c /etc/nagios/send_nsca.cfg
    

Priority order for resource allocation:

  1. RAM upgrade to 2GB (immediate 400% capacity increase)
  2. SSD storage for check result processing
  3. Additional CPU cores for concurrent check processing

Your Nagios setup (2.0GHz CPU, 400MB RAM, RAID10) monitoring 42 services across 8 hosts with 5-minute intervals (some at 1-minute) plus Cacti polling 6 hosts every minute is pushing current hardware limits. Typical load averages above 4-6 indicate resource contention.

# Example check_interval impact analysis:
Hosts: 8 | Services: 42
1-minute checks: 12 services → 720 checks/hour
5-minute checks: 30 services → 360 checks/hour
Total: 1080 checks/hour + Cacti (6 hosts × 60 = 360 polls/hour)

Try these nagios.cfg tweaks before hardware upgrades:

# Increase check processing parallelism
max_concurrent_checks=50
service_check_timeout=30
check_result_reaper_frequency=2

# Reduce disk I/O
use_retained_program_state=1
interval_length=60

For hardware-constrained environments, consider NSCA passive checks:

# On remote host:
define command {
    command_name    submit_to_nsca
    command_line    /usr/lib/nagios/plugins/check_dummy 0 "Result submitted" | /usr/sbin/send_nsca -H nagios.server -c /etc/nagios/send_nsca.cfg
}

If optimizations fail, prioritize upgrades in this order:

  • RAM: 400MB → 2GB (reduces swap thrashing)
  • CPU: 2.0GHz → 3.0GHz+ multi-core (helps parallel checks)
  • Storage: RAID10 SSD (faster check result processing)

A typical Nagios server handling 100+ services would have:

- 4GB RAM
- 4 CPU cores @ 2.5GHz+
- SSD storage
- Load averages between 1.5-3.0

Use this bash script to simulate check loads:

#!/bin/bash
for i in {1..100}; do
    /usr/lib/nagios/plugins/check_dummy 0 "Test $i" &
    sleep 0.1
done
wait
echo "Load test completed"