Many Zabbix administrators face notification storms where the system generates dozens of "PROBLEM" and "OK" alerts for temporarily unreachable hosts. This typically happens when:
- Network latency exceeds Zabbix's default timeout thresholds
- Hosts experience brief connectivity issues
- Agent checks fail during system maintenance
Here are the most effective ways to reduce false positives:
1. Increase Agent Timeout Values
Modify the agent configuration file (zabbix_agentd.conf):
Timeout=30
And adjust the corresponding server-side item:
zabbix_get -s hostname -k agent.ping --timeout 30
2. Implement Trigger Dependencies
Create a master trigger for network availability:
{Template Module ICMP Ping:icmpping.seq(0)}>0
Then make other triggers dependent on this basic connectivity check.
3. Adjust Trigger Functions
Use nodata()
with appropriate time periods:
{host:agent.ping.nodata(5m)}=1
This waits 5 minutes before declaring a host unreachable.
Custom Alert Escalation
Configure different actions based on problem duration:
if {TRIGGER.VALUE}=1 and {EVENT.AGE} > 900 then send_critical_alert()
Use Active Checks
Active agents handle network issues better than passive checks:
ServerActive=zabbix.example.com Hostname=client.example.com RefreshActiveChecks=120
For distributed environments:
- Place proxies closer to monitored hosts
- Implement local buffer monitoring during outages
- Use heartbeat items for connection state tracking
Remember to test changes in a staging environment before production deployment. The optimal configuration depends on your specific network characteristics and monitoring requirements.
When I first deployed Zabbix across my infrastructure with about 6-7 monitored VPS instances, the constant barrage of "PROBLEM/OK" notifications for host reachability became unbearable. The server would fire alerts like clockwork:
Subject: PROBLEM: Server web-01 is unreachable
Subject: OK: Server web-01 is unreachable
This created alert fatigue where real issues could easily get lost in the noise. Through trial and error, I discovered several tuning approaches.
The root cause lies in Zabbix's default trigger sensitivity. The key parameters to modify are:
# Original problematic trigger expression
{zabbix.server:icmpping.count(30m,0)}<1
This checks for just a single failed ping. Here's my improved version:
# More resilient trigger expression
{zabbix.server:icmpping.count(5m,0)}>=3 and {zabbix.server:icmpping.count(1h,0)}>=5
This requires multiple consecutive failures before triggering.
Implementing notification throttling prevents alert storms:
# In zabbix_server.conf
AlertScriptsPath=/usr/lib/zabbix/alertscripts
StartAlerters=3
MaxHousekeeperDelete=5000
CacheSize=8M
For action configurations:
- Set minimum problem duration to 5 minutes
- Enable "Recovery message" option
- Configure escalation steps with delays
Here's a complete trigger configuration that worked for my environment:
{
"description": "Host unreachable (stable detection)",
"expression": "last(/Host A/icmpping.seq)<>last(/Host A/icmpping.seq,#2) and " +
"count(/Host A/icmpping,5m,0)>3",
"priority": "HIGH",
"recovery_mode": "1",
"recovery_expression": "last(/Host A/icmpping.seq)=last(/Host A/icmpping.seq,#2)",
"type": "0"
}
This combines sequence checking with failure count thresholds.
For critical infrastructure, implement a trigger hierarchy:
# Parent trigger (less sensitive)
{zabbix.server:net.tcp.service[ssh].max(5m)}=0
# Child trigger (more sensitive)
{zabbix.server:net.tcp.service[ssh].max(1m)}=0 and
{zabbix.server:net.tcp.service[ssh].max(5m)}=0
This creates a two-stage verification system.