Implementing Adaptive Check Intervals in Nagios for Disk Thrashing Detection Based on SI Metric


2 views

When dealing with performance monitoring, disk thrashing is one of those subtle issues that can cripple a system silently. Traditional fixed-interval checks often miss critical transitions or create unnecessary load. The key challenge is implementing adaptive check intervals that respond dynamically to system state changes.

Here's how to implement this in Nagios using the check_vmstat plugin with variable check intervals:

define service {
    host_name               linux-server
    service_description     Disk Thrashing
    check_command           check_vmstat!-w 10 -c 20 --si
    normal_check_interval   20
    retry_check_interval    3
    max_check_attempts      5
    notification_interval   30
    notification_options    w,c,r
    contact_groups          linux-admins
}

The magic happens through these parameters:

  • normal_check_interval 20 - Default 20-minute interval
  • retry_check_interval 3 - 3-minute interval when non-OK
  • max_check_attempts 5 - Number of retries before hard state

For more sophisticated thrashing detection, you might need a custom script. Here's a Python example:

#!/usr/bin/env python3
import subprocess
import sys

def check_thrashing():
    result = subprocess.run(['vmstat', '1', '2'], 
                          stdout=subprocess.PIPE,
                          text=True)
    lines = result.stdout.split('\n')
    si_values = [int(line.split()[6]) for line in lines[2:] if line]
    
    avg_si = sum(si_values) / len(si_values)
    
    if avg_si > 20:
        print("CRITICAL: Severe disk thrashing (si={})".format(avg_si))
        return 2
    elif avg_si > 10:
        print("WARNING: Moderate disk thrashing (si={})".format(avg_si))
        return 1
    else:
        print("OK: No significant thrashing (si={})".format(avg_si))
        return 0

if __name__ == "__main__":
    sys.exit(check_thrashing())

For remote checks using NRPE:

command[check_thrashing]=/usr/local/nagios/libexec/check_thrashing.py

When implementing adaptive checks:

  • Balance detection speed with system load
  • Consider using passive checks for critical systems
  • Implement service dependencies to prevent cascading checks

When monitoring disk thrashing via vmstat's si (swap-in) metric, we need a dynamic check interval strategy in Nagios:

  • Default interval: 20 minutes during normal operation
  • Critical state interval: 3 minutes when Warning/Critical thresholds are breached
  • Automatic reversion to normal interval when service returns to OK state
  • Preserve existing 5-minute intervals for other services

The solution involves three key Nagios configuration files:

# thrashing_service.cfg
define service {
    host_name               monitored-server
    service_description     Disk_Thrashing
    check_command           check_thrashing
    normal_check_interval   20      # Default 20-minute interval
    retry_check_interval    3       # 3-minute interval during problems
    max_check_attempts      1       # Immediate state change (no soft states)
    notification_interval   0       # Disable notifications if you prefer
    event_handler           adjust_thrashing_interval!3!20
}

Create an event handler script (/usr/local/nagios/libexec/eventhandlers/adjust_thrashing_interval):

#!/bin/bash
# Arguments: $1=problem_interval $2=normal_interval

SERVICE=$NAGIOS_SERVICEDESC
HOST=$NAGIOS_HOSTNAME

case "$NAGIOS_SERVICESTATE" in
    "OK")
        NEW_INTERVAL=$2
        ;;
    "WARNING"|"CRITICAL")
        NEW_INTERVAL=$1
        ;;
    *)
        exit 0
        ;;
esac

# Update the check interval via command file
printf "[%lu] CHANGE_NORMAL_SVC_CHECK_INTERVAL;%s;%s;%s\n" \
    $(date +%s) "$HOST" "$SERVICE" "$NEW_INTERVAL" \
    > /usr/local/nagios/var/rw/nagios.cmd

exit 0

Ensure Nagios processes external commands:

# nagios.cfg
check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd

Define the actual thrashing check in commands.cfg:

define command {
    command_name    check_thrashing
    command_line    /usr/local/nagios/libexec/check_thrashing.sh
}

Sample check script (check_thrashing.sh):

#!/bin/bash
# Swap-in threshold check (pages/second)
WARN=100
CRIT=500

si=$(vmstat 1 2 | tail -1 | awk '{print $7}')

if [ $si -ge $CRIT ]; then
    echo "CRITICAL: Disk thrashing detected - $si pages/s swapped in"
    exit 2
elif [ $si -ge $WARN ]; then
    echo "WARNING: Disk thrashing developing - $si pages/s swapped in"
    exit 1
else
    echo "OK: Normal swap activity - $si pages/s"
    exit 0
fi

Key testing scenarios:

  1. Force Warning state: Verify check interval changes to 3 minutes
  2. Simulate Critical state: Confirm increased check frequency
  3. Return to OK state: Validate automatic reversion to 20-minute interval
  4. Check other services: Ensure their 5-minute intervals remain unaffected

View current check intervals with:

grep 'servicecheck' /usr/local/nagios/var/nagios.log | grep 'Disk_Thrashing'