Optimal Warning and Critical Thresholds for check_load in Nagios: A Data-Driven Approach


9 views

The load average represents the system load over specific time intervals (1, 5, and 15 minutes). In Nagios monitoring, we need to set appropriate thresholds that reflect actual server capacity rather than arbitrary values.

A more scientific approach involves basing thresholds on the number of CPU cores:

# Formula:
# threshold_value = (cores * percentage) / 100
#
# Example for 4-core server:
# Warning at 90% of capacity for 5-minute load:
# (4 * 90) / 100 = 3.6

# 4-core server thresholds
command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,3.2,2.8 -c 4.0,3.6,3.2

Based on production experience with various server types:

General Purpose Servers

# 8-core web servers
command[check_load]=/usr/local/nagios/libexec/check_load -w 7.2,6.4,5.6 -c 8.0,7.2,6.4

Database Servers

# 16-core database servers (more conservative thresholds)
command[check_load]=/usr/local/nagios/libexec/check_load -w 12.8,11.2,9.6 -c 14.4,12.8,11.2

For hyper-threaded systems, some administrators use:

# Using logical cores count
TOTAL_CORES=$(grep -c ^processor /proc/cpuinfo)
WARN_5MIN=$(echo "$TOTAL_CORES * 0.85" | bc)
CRIT_5MIN=$(echo "$TOTAL_CORES * 0.95" | bc)
command[check_load]=/usr/local/nagios/libexec/check_load -w $WARN_5MIN,$(echo "$TOTAL_CORES * 0.75" | bc),$(echo "$TOTAL_CORES * 0.65" | bc) -c $CRIT_5MIN,$(echo "$TOTAL_CORES * 0.85" | bc),$(echo "$TOTAL_CORES * 0.75" | bc)

For a 24-core production server handling mixed workloads:

#!/bin/bash
CORES=24
WARN_1=$(echo "$CORES * 0.9" | bc -l | awk '{printf "%.1f", $1}')
WARN_5=$(echo "$CORES * 0.8" | bc -l | awk '{printf "%.1f", $1}')
WARN_15=$(echo "$CORES * 0.7" | bc -l | awk '{printf "%.1f", $1}')

CRIT_1=$(echo "$CORES * 1.0" | bc -l | awk '{printf "%.1f", $1}')
CRIT_5=$(echo "$CORES * 0.9" | bc -l | awk '{printf "%.1f", $1}')
CRIT_15=$(echo "$CORES * 0.8" | bc -l | awk '{printf "%.1f", $1}')

command[check_load]=/usr/local/nagios/libexec/check_load -w ${WARN_1},${WARN_5},${WARN_15} -c ${CRIT_1},${CRIT_5},${CRIT_15}

Before setting thresholds, it's crucial to understand how Linux calculates load average. The three numbers represent:

  • 1-minute load average
  • 5-minute load average
  • 15-minute load average

For a 4-core system, your current thresholds translate to:

# 1-min  5-min  15-min
-w 3.6,  2.8,   2.0   # Warning
-c 4.0,  3.2,   2.4   # Critical

After monitoring hundreds of production systems, I recommend these formulas:

# Warning = (cores * 0.9), (cores * 0.7), (cores * 0.5)
# Critical = (cores * 1.0), (cores * 0.8), (cores * 0.6)

# For 4-core system:
command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4

Here's a bash script to generate thresholds dynamically:

#!/bin/bash
CORES=$(nproc)
WARN_1=$(echo "$CORES * 0.9" | bc)
WARN_5=$(echo "$CORES * 0.7" | bc)
WARN_15=$(echo "$CORES * 0.5" | bc)
CRIT_1=$(echo "$CORES * 1.0" | bc)
CRIT_5=$(echo "$CORES * 0.8" | bc)
CRIT_15=$(echo "$CORES * 0.6" | bc)

echo "command[check_load]=/usr/local/nagios/libexec/check_load \\
-w ${WARN_1%.*},${WARN_5%.*},${WARN_15%.*} \\
-c ${CRIT_1%.*},${CRIT_5%.*},${CRIT_15%.*}"

For different server configurations:

# 8-core web server (handles spikes well)
-w 7.2,5.6,4.0 -c 8.0,6.4,4.8

# 2-core database server (needs headroom)
-w 1.6,1.2,0.8 -c 2.0,1.6,1.2

# 16-core batch processing (expect high load)
-w 12.8,10.0,7.2 -c 16.0,12.8,9.6

Consider modifying these values when:

  • Running burstable cloud instances
  • Hosting latency-sensitive applications
  • Using containers with CPU limits
  • Operating hybrid physical/virtual environments

For containerized environments, use this modified check:

command[check_load]=/usr/local/nagios/libexec/check_load \\
-w $(( $(nproc) * 60 / 100 )),$(( $(nproc) * 40 / 100 )),$(( $(nproc) * 30 / 100 )) \\
-c $(( $(nproc) * 80 / 100 )),$(( $(nproc) * 60 / 100 )),$(( $(nproc) * 40 / 100 ))