How to Reduce Excessive Zabbix Notifications for Unreachable Hosts


10 views

Many Zabbix administrators face notification storms where the system generates dozens of "PROBLEM" and "OK" alerts for temporarily unreachable hosts. This typically happens when:

  • Network latency exceeds Zabbix's default timeout thresholds
  • Hosts experience brief connectivity issues
  • Agent checks fail during system maintenance

Here are the most effective ways to reduce false positives:

1. Increase Agent Timeout Values

Modify the agent configuration file (zabbix_agentd.conf):

Timeout=30

And adjust the corresponding server-side item:

zabbix_get -s hostname -k agent.ping --timeout 30

2. Implement Trigger Dependencies

Create a master trigger for network availability:

{Template Module ICMP Ping:icmpping.seq(0)}>0

Then make other triggers dependent on this basic connectivity check.

3. Adjust Trigger Functions

Use nodata() with appropriate time periods:

{host:agent.ping.nodata(5m)}=1

This waits 5 minutes before declaring a host unreachable.

Custom Alert Escalation

Configure different actions based on problem duration:

if {TRIGGER.VALUE}=1 and {EVENT.AGE} > 900
then send_critical_alert()

Use Active Checks

Active agents handle network issues better than passive checks:

ServerActive=zabbix.example.com
Hostname=client.example.com
RefreshActiveChecks=120

For distributed environments:

  • Place proxies closer to monitored hosts
  • Implement local buffer monitoring during outages
  • Use heartbeat items for connection state tracking

Remember to test changes in a staging environment before production deployment. The optimal configuration depends on your specific network characteristics and monitoring requirements.


When I first deployed Zabbix across my infrastructure with about 6-7 monitored VPS instances, the constant barrage of "PROBLEM/OK" notifications for host reachability became unbearable. The server would fire alerts like clockwork:

Subject: PROBLEM: Server web-01 is unreachable
Subject: OK: Server web-01 is unreachable

This created alert fatigue where real issues could easily get lost in the noise. Through trial and error, I discovered several tuning approaches.

The root cause lies in Zabbix's default trigger sensitivity. The key parameters to modify are:

# Original problematic trigger expression
{zabbix.server:icmpping.count(30m,0)}<1

This checks for just a single failed ping. Here's my improved version:

# More resilient trigger expression
{zabbix.server:icmpping.count(5m,0)}>=3 and {zabbix.server:icmpping.count(1h,0)}>=5

This requires multiple consecutive failures before triggering.

Implementing notification throttling prevents alert storms:

# In zabbix_server.conf
AlertScriptsPath=/usr/lib/zabbix/alertscripts
StartAlerters=3
MaxHousekeeperDelete=5000
CacheSize=8M

For action configurations:

  • Set minimum problem duration to 5 minutes
  • Enable "Recovery message" option
  • Configure escalation steps with delays

Here's a complete trigger configuration that worked for my environment:

{
    "description": "Host unreachable (stable detection)",
    "expression": "last(/Host A/icmpping.seq)<>last(/Host A/icmpping.seq,#2) and " +
                 "count(/Host A/icmpping,5m,0)>3",
    "priority": "HIGH",
    "recovery_mode": "1",
    "recovery_expression": "last(/Host A/icmpping.seq)=last(/Host A/icmpping.seq,#2)",
    "type": "0"
}

This combines sequence checking with failure count thresholds.

For critical infrastructure, implement a trigger hierarchy:

# Parent trigger (less sensitive)
{zabbix.server:net.tcp.service[ssh].max(5m)}=0

# Child trigger (more sensitive)
{zabbix.server:net.tcp.service[ssh].max(1m)}=0 and 
{zabbix.server:net.tcp.service[ssh].max(5m)}=0

This creates a two-stage verification system.