When running critical services with systemd, we often need immediate notification when things go wrong - whether it's a full crash or a silent hang. While systemd provides excellent monitoring capabilities through features like WatchdogSec
, getting real-time alerts requires some clever configuration.
The obvious approach would be using OnFailure
in the service unit, but this only allows limited actions like reboots. For custom network notifications, we need to get more creative with systemd's event system.
Here's a comprehensive approach that handles both crash and hang detection:
[Unit]
Description=My Critical Service
After=network.target
[Service]
ExecStart=/usr/bin/my-service
WatchdogSec=30s
Restart=on-failure
ExecStartPost=/usr/bin/systemd-notify --ready
# Crash detection
OnFailure=/usr/local/bin/notify-crash.sh
# Hang detection via watchdog
ExecReload=/usr/local/bin/notify-hang.sh
Create two scripts - one for crash detection, one for hangs:
#!/bin/bash
# /usr/local/bin/notify-crash.sh
curl -X POST https://api.monitoring.example.com/alerts \
-H "Content-Type: application/json" \
-d '{"service":"myservice","status":"crashed","timestamp":"$(date -u +"%Y-%m-%dT%H:%M:%SZ")"}'
#!/bin/bash
# /usr/local/bin/notify-hang.sh
curl -X POST https://api.monitoring.example.com/alerts \
-H "Content-Type: application/json" \
-d '{"service":"myservice","status":"hung","timestamp":"$(date -u +"%Y-%m-%dT%H:%M:%SZ")"}'
For more complex scenarios, you can create path units that trigger on journal entries:
[Unit]
Description=Watch for service failures
[Path]
PathChanged=/var/log/journal
MakeDirectory=yes
Unit=notify-failure.service
[Install]
WantedBy=multi-user.target
This approach has minimal overhead as it leverages systemd's native event system rather than polling. The notification scripts should be lightweight - consider queueing mechanisms if network calls might block.
For ultimate control, you can use journalctl
with persistent monitoring:
journalctl -u my-service -f -o json | grep --line-buffered '"MESSAGE_ID"="..."' | while read -r line
do
# Parse and send notifications
done
When running critical services under systemd, we often need immediate notification when things go wrong - whether from crashes (failed state) or hangs (WatchdogSec timeout). While systemd's built-in FailureAction
provides some options, it lacks the flexibility to trigger custom network alerts.
We'll leverage systemd's OnFailure
directive combined with a dedicated notifier service. This creates an event-driven approach without log parsing or polling delays.
1. Create Notification Service
First, implement a service that handles the network messaging:
[Unit]
Description=Service Failure Notifier
[Service]
Type=simple
ExecStart=/usr/local/bin/send-service-alert %i
2. Configure Main Service
Modify your monitored service unit file:
[Unit]
OnFailure=notify-service@%n.service
[Service]
WatchdogSec=30s
Restart=on-failure
3. Alert Script Implementation
Here's a sample Python notifier script (/usr/local/bin/send-service-alert):
#!/usr/bin/env python3
import socket
import sys
service_name = sys.argv[1]
message = f"ALERT: Service {service_name} failed"
# UDP broadcast example
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
sock.sendto(message.encode(), ('255.255.255.255', 9999))
For distinguishing between crashes and hangs:
[Service]
# For crash detection
OnFailure=notify-service@crash.%n.service
# For watchdog timeout (hang)
ExecStartPre=/bin/sh -c 'systemctl set-property notify-service@hang.%n.service ActiveEnterTimestampMonotonic=$(cat /proc/self/mountinfo | grep -m1 " / " | cut -d" " -f10)'
WatchdogSec=30s
For modern environments, consider HTTP notifications:
#!/usr/bin/env python3
import requests
import json
import sys
alert = {
"service": sys.argv[1],
"failure_type": "hang" if "hang" in sys.argv[1] else "crash"
}
requests.post("https://monitoring.example.com/alerts",
json=alert,
headers={"Content-Type": "application/json"})
Test your setup by forcing failures:
# Simulate crash
systemctl kill -s SIGSEGV your-service.service
# Simulate hang (watchdog timeout)
systemctl kill -s SIGSTOP your-service.service