Instant Systemd Service Failure/Hung Monitoring with Network Notification Triggers


2 views

When running critical services with systemd, we often need immediate notification when things go wrong - whether it's a full crash or a silent hang. While systemd provides excellent monitoring capabilities through features like WatchdogSec, getting real-time alerts requires some clever configuration.

The obvious approach would be using OnFailure in the service unit, but this only allows limited actions like reboots. For custom network notifications, we need to get more creative with systemd's event system.

Here's a comprehensive approach that handles both crash and hang detection:

[Unit]
Description=My Critical Service
After=network.target

[Service]
ExecStart=/usr/bin/my-service
WatchdogSec=30s
Restart=on-failure
ExecStartPost=/usr/bin/systemd-notify --ready

# Crash detection
OnFailure=/usr/local/bin/notify-crash.sh

# Hang detection via watchdog
ExecReload=/usr/local/bin/notify-hang.sh

Create two scripts - one for crash detection, one for hangs:

#!/bin/bash
# /usr/local/bin/notify-crash.sh
curl -X POST https://api.monitoring.example.com/alerts \
  -H "Content-Type: application/json" \
  -d '{"service":"myservice","status":"crashed","timestamp":"$(date -u +"%Y-%m-%dT%H:%M:%SZ")"}'
#!/bin/bash
# /usr/local/bin/notify-hang.sh
curl -X POST https://api.monitoring.example.com/alerts \
  -H "Content-Type: application/json" \
  -d '{"service":"myservice","status":"hung","timestamp":"$(date -u +"%Y-%m-%dT%H:%M:%SZ")"}'

For more complex scenarios, you can create path units that trigger on journal entries:

[Unit]
Description=Watch for service failures

[Path]
PathChanged=/var/log/journal
MakeDirectory=yes
Unit=notify-failure.service

[Install]
WantedBy=multi-user.target

This approach has minimal overhead as it leverages systemd's native event system rather than polling. The notification scripts should be lightweight - consider queueing mechanisms if network calls might block.

For ultimate control, you can use journalctl with persistent monitoring:

journalctl -u my-service -f -o json | grep --line-buffered '"MESSAGE_ID"="..."' | while read -r line
do
  # Parse and send notifications
done

When running critical services under systemd, we often need immediate notification when things go wrong - whether from crashes (failed state) or hangs (WatchdogSec timeout). While systemd's built-in FailureAction provides some options, it lacks the flexibility to trigger custom network alerts.

We'll leverage systemd's OnFailure directive combined with a dedicated notifier service. This creates an event-driven approach without log parsing or polling delays.

1. Create Notification Service

First, implement a service that handles the network messaging:


[Unit]
Description=Service Failure Notifier

[Service]
Type=simple
ExecStart=/usr/local/bin/send-service-alert %i

2. Configure Main Service

Modify your monitored service unit file:


[Unit]
OnFailure=notify-service@%n.service

[Service]
WatchdogSec=30s
Restart=on-failure

3. Alert Script Implementation

Here's a sample Python notifier script (/usr/local/bin/send-service-alert):


#!/usr/bin/env python3
import socket
import sys

service_name = sys.argv[1]
message = f"ALERT: Service {service_name} failed"

# UDP broadcast example
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
sock.sendto(message.encode(), ('255.255.255.255', 9999))

For distinguishing between crashes and hangs:


[Service]
# For crash detection
OnFailure=notify-service@crash.%n.service

# For watchdog timeout (hang)
ExecStartPre=/bin/sh -c 'systemctl set-property notify-service@hang.%n.service ActiveEnterTimestampMonotonic=$(cat /proc/self/mountinfo | grep -m1 " / " | cut -d" " -f10)'
WatchdogSec=30s

For modern environments, consider HTTP notifications:


#!/usr/bin/env python3
import requests
import json
import sys

alert = {
    "service": sys.argv[1],
    "failure_type": "hang" if "hang" in sys.argv[1] else "crash"
}

requests.post("https://monitoring.example.com/alerts",
    json=alert,
    headers={"Content-Type": "application/json"})

Test your setup by forcing failures:


# Simulate crash
systemctl kill -s SIGSEGV your-service.service

# Simulate hang (watchdog timeout)
systemctl kill -s SIGSTOP your-service.service