How to Reduce Excessive Zabbix Notifications for Unreachable Hosts


2 views

Many Zabbix administrators face notification storms where the system generates dozens of "PROBLEM" and "OK" alerts for temporarily unreachable hosts. This typically happens when:

  • Network latency exceeds Zabbix's default timeout thresholds
  • Hosts experience brief connectivity issues
  • Agent checks fail during system maintenance

Here are the most effective ways to reduce false positives:

1. Increase Agent Timeout Values

Modify the agent configuration file (zabbix_agentd.conf):

Timeout=30

And adjust the corresponding server-side item:

zabbix_get -s hostname -k agent.ping --timeout 30

2. Implement Trigger Dependencies

Create a master trigger for network availability:

{Template Module ICMP Ping:icmpping.seq(0)}>0

Then make other triggers dependent on this basic connectivity check.

3. Adjust Trigger Functions

Use nodata() with appropriate time periods:

{host:agent.ping.nodata(5m)}=1

This waits 5 minutes before declaring a host unreachable.

Custom Alert Escalation

Configure different actions based on problem duration:

if {TRIGGER.VALUE}=1 and {EVENT.AGE} > 900
then send_critical_alert()

Use Active Checks

Active agents handle network issues better than passive checks:

ServerActive=zabbix.example.com
Hostname=client.example.com
RefreshActiveChecks=120

For distributed environments:

  • Place proxies closer to monitored hosts
  • Implement local buffer monitoring during outages
  • Use heartbeat items for connection state tracking

Remember to test changes in a staging environment before production deployment. The optimal configuration depends on your specific network characteristics and monitoring requirements.


When I first deployed Zabbix across my infrastructure with about 6-7 monitored VPS instances, the constant barrage of "PROBLEM/OK" notifications for host reachability became unbearable. The server would fire alerts like clockwork:

Subject: PROBLEM: Server web-01 is unreachable
Subject: OK: Server web-01 is unreachable

This created alert fatigue where real issues could easily get lost in the noise. Through trial and error, I discovered several tuning approaches.

The root cause lies in Zabbix's default trigger sensitivity. The key parameters to modify are:

# Original problematic trigger expression
{zabbix.server:icmpping.count(30m,0)}<1

This checks for just a single failed ping. Here's my improved version:

# More resilient trigger expression
{zabbix.server:icmpping.count(5m,0)}>=3 and {zabbix.server:icmpping.count(1h,0)}>=5

This requires multiple consecutive failures before triggering.

Implementing notification throttling prevents alert storms:

# In zabbix_server.conf
AlertScriptsPath=/usr/lib/zabbix/alertscripts
StartAlerters=3
MaxHousekeeperDelete=5000
CacheSize=8M

For action configurations:

  • Set minimum problem duration to 5 minutes
  • Enable "Recovery message" option
  • Configure escalation steps with delays

Here's a complete trigger configuration that worked for my environment:

{
    "description": "Host unreachable (stable detection)",
    "expression": "last(/Host A/icmpping.seq)<>last(/Host A/icmpping.seq,#2) and " +
                 "count(/Host A/icmpping,5m,0)>3",
    "priority": "HIGH",
    "recovery_mode": "1",
    "recovery_expression": "last(/Host A/icmpping.seq)=last(/Host A/icmpping.seq,#2)",
    "type": "0"
}

This combines sequence checking with failure count thresholds.

For critical infrastructure, implement a trigger hierarchy:

# Parent trigger (less sensitive)
{zabbix.server:net.tcp.service[ssh].max(5m)}=0

# Child trigger (more sensitive)
{zabbix.server:net.tcp.service[ssh].max(1m)}=0 and 
{zabbix.server:net.tcp.service[ssh].max(5m)}=0

This creates a two-stage verification system.