When dealing with performance monitoring, disk thrashing is one of those subtle issues that can cripple a system silently. Traditional fixed-interval checks often miss critical transitions or create unnecessary load. The key challenge is implementing adaptive check intervals that respond dynamically to system state changes.
Here's how to implement this in Nagios using the check_vmstat
plugin with variable check intervals:
define service {
host_name linux-server
service_description Disk Thrashing
check_command check_vmstat!-w 10 -c 20 --si
normal_check_interval 20
retry_check_interval 3
max_check_attempts 5
notification_interval 30
notification_options w,c,r
contact_groups linux-admins
}
The magic happens through these parameters:
normal_check_interval 20
- Default 20-minute intervalretry_check_interval 3
- 3-minute interval when non-OKmax_check_attempts 5
- Number of retries before hard state
For more sophisticated thrashing detection, you might need a custom script. Here's a Python example:
#!/usr/bin/env python3
import subprocess
import sys
def check_thrashing():
result = subprocess.run(['vmstat', '1', '2'],
stdout=subprocess.PIPE,
text=True)
lines = result.stdout.split('\n')
si_values = [int(line.split()[6]) for line in lines[2:] if line]
avg_si = sum(si_values) / len(si_values)
if avg_si > 20:
print("CRITICAL: Severe disk thrashing (si={})".format(avg_si))
return 2
elif avg_si > 10:
print("WARNING: Moderate disk thrashing (si={})".format(avg_si))
return 1
else:
print("OK: No significant thrashing (si={})".format(avg_si))
return 0
if __name__ == "__main__":
sys.exit(check_thrashing())
For remote checks using NRPE:
command[check_thrashing]=/usr/local/nagios/libexec/check_thrashing.py
When implementing adaptive checks:
- Balance detection speed with system load
- Consider using passive checks for critical systems
- Implement service dependencies to prevent cascading checks
When monitoring disk thrashing via vmstat
's si
(swap-in) metric, we need a dynamic check interval strategy in Nagios:
- Default interval: 20 minutes during normal operation
- Critical state interval: 3 minutes when Warning/Critical thresholds are breached
- Automatic reversion to normal interval when service returns to OK state
- Preserve existing 5-minute intervals for other services
The solution involves three key Nagios configuration files:
# thrashing_service.cfg
define service {
host_name monitored-server
service_description Disk_Thrashing
check_command check_thrashing
normal_check_interval 20 # Default 20-minute interval
retry_check_interval 3 # 3-minute interval during problems
max_check_attempts 1 # Immediate state change (no soft states)
notification_interval 0 # Disable notifications if you prefer
event_handler adjust_thrashing_interval!3!20
}
Create an event handler script (/usr/local/nagios/libexec/eventhandlers/adjust_thrashing_interval
):
#!/bin/bash
# Arguments: $1=problem_interval $2=normal_interval
SERVICE=$NAGIOS_SERVICEDESC
HOST=$NAGIOS_HOSTNAME
case "$NAGIOS_SERVICESTATE" in
"OK")
NEW_INTERVAL=$2
;;
"WARNING"|"CRITICAL")
NEW_INTERVAL=$1
;;
*)
exit 0
;;
esac
# Update the check interval via command file
printf "[%lu] CHANGE_NORMAL_SVC_CHECK_INTERVAL;%s;%s;%s\n" \
$(date +%s) "$HOST" "$SERVICE" "$NEW_INTERVAL" \
> /usr/local/nagios/var/rw/nagios.cmd
exit 0
Ensure Nagios processes external commands:
# nagios.cfg
check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
Define the actual thrashing check in commands.cfg:
define command {
command_name check_thrashing
command_line /usr/local/nagios/libexec/check_thrashing.sh
}
Sample check script (check_thrashing.sh
):
#!/bin/bash
# Swap-in threshold check (pages/second)
WARN=100
CRIT=500
si=$(vmstat 1 2 | tail -1 | awk '{print $7}')
if [ $si -ge $CRIT ]; then
echo "CRITICAL: Disk thrashing detected - $si pages/s swapped in"
exit 2
elif [ $si -ge $WARN ]; then
echo "WARNING: Disk thrashing developing - $si pages/s swapped in"
exit 1
else
echo "OK: Normal swap activity - $si pages/s"
exit 0
fi
Key testing scenarios:
- Force Warning state: Verify check interval changes to 3 minutes
- Simulate Critical state: Confirm increased check frequency
- Return to OK state: Validate automatic reversion to 20-minute interval
- Check other services: Ensure their 5-minute intervals remain unaffected
View current check intervals with:
grep 'servicecheck' /usr/local/nagios/var/nagios.log | grep 'Disk_Thrashing'