When dealing with critical system services, we often configure aggressive restart policies in systemd:
[Service]
Restart=always
RestartSec=5s
StartLimitInterval=300s
StartLimitBurst=5
The problem emerges when we need custom behavior at the start limit boundary - systemd only offers extreme actions like reboot
or poweroff
through StartLimitAction
.
We'll implement a two-layer monitoring system:
- Primary service with standard restart policy
- Failure detection unit that triggers only at limit threshold
1. Configure the Main Service
[Unit]
Description=MySQL Maintenance Service
OnFailure=notify-failure@%i.service
[Service]
ExecStart=/usr/local/bin/mysql-maintenance
Restart=always
RestartSec=30s
StartLimitInterval=5min
StartLimitBurst=3
2. Create the Threshold Detection Service
[Unit]
Description=Service Failure Threshold Notifier %I
[Service]
Type=oneshot
ExecStart=/usr/local/bin/threshold-handler %i
User=alertuser
3. The Threshold Handler Script (/usr/local/bin/threshold-handler
):
#!/bin/bash
SERVICE=$1
FAIL_COUNT=$(systemctl show -p NRestarts $SERVICE | cut -d= -f2)
LIMIT_BURST=$(systemctl show -p StartLimitBurst $SERVICE | cut -d= -f2)
if [ "$FAIL_COUNT" -ge "$LIMIT_BURST" ]; then
/usr/local/bin/send-alert --service "$SERVICE" \
--failure-count "$FAIL_COUNT" \
--hostname "$(hostname)"
fi
To avoid false positives, enhance the handler script with state verification:
#!/bin/bash
SERVICE=$1
# Get systemd properties
ACTIVE_STATE=$(systemctl show -p ActiveState $SERVICE | cut -d= -f2)
SUB_STATE=$(systemctl show -p SubState $SERVICE | cut -d= -f2)
FAIL_COUNT=$(systemctl show -p NRestarts $SERVICE | cut -d= -f2)
LIMIT_BURST=$(systemctl show -p StartLimitBurst $SERVICE | cut -d= -f2)
# Only trigger if both conditions are met:
# 1. Service is in failed state
# 2. Restart attempts exhausted
if [ "$ACTIVE_STATE" = "failed" ] && \
[ "$SUB_STATE" = "failed" ] && \
[ "$FAIL_COUNT" -ge "$LIMIT_BURST" ]; then
# Custom handling logic here
fi
For more complex scenarios, consider monitoring the service state file:
[Unit]
Description=Watch for service failure threshold
[Path]
PathModified=/run/systemd/system/mysql-maintenance.service
Unit=threshold-handler.service
[Install]
WantedBy=multi-user.target
The handler service would then parse the service state from /run/systemd/system/
.
- Add rate limiting to the alert mechanism
- Include service logs in notifications
- Monitor handler script execution time
- Implement proper error handling for the handler
When designing robust systemd services, we often use Restart=always
to maintain service availability. However, this creates a new problem - how to distinguish between temporary failures (where auto-restart makes sense) and permanent failures that require human intervention.
[Service]
Restart=always
StartLimitInterval=60s
StartLimitBurst=5
The default systemd options only provide nuclear options when start limits are reached:
[Service]
StartLimitAction=reboot|poweroff|none
What we really want is to execute custom logic - sending alerts, running cleanup scripts, or triggering other services.
Here's how to implement a sophisticated failure handling system:
[Unit]
OnFailure=notify-email@%i.service
[Service]
Restart=always
RestartSec=5s
StartLimitInterval=300s
StartLimitBurst=5
StartLimitAction=none
# This script checks if we hit start limit
ExecStopPost=/usr/local/bin/check-start-limit.sh %n
Create /usr/local/bin/check-start-limit.sh
:
#!/bin/bash
SERVICE="$1"
MAX_RETRIES=5
systemctl show -p NRestarts "$SERVICE" | grep -q "NRestarts=$MAX_RETRIES" && \
systemctl start start-limit-handler@$SERVICE
Add a template service for limit handling:
[Unit]
Description=Start limit handler for %i
[Service]
Type=oneshot
ExecStart=/usr/local/bin/handle-start-limit.sh %i
To prevent duplicate alerts, add rate limiting to your handler:
#!/bin/bash
SERVICE="$1"
LOCK_FILE="/tmp/${SERVICE}_limit_alert.lock"
# Only send alert if lock doesn't exist or is older than 1 hour
if [ ! -f "$LOCK_FILE" ] || [ $(($(date +%s) - $(stat -c %Y "$LOCK_FILE"))) -gt 3600 ]; then
touch "$LOCK_FILE"
# Your alert logic here
echo "Service $SERVICE hit start limit" | mail -s "Alert" admin@example.com
fi
For more complex scenarios, you can monitor service states programmatically:
#!/usr/bin/python3
import systemd.journal
from systemd import daemon
reader = systemd.journal.Reader()
reader.add_match(_SYSTEMD_UNIT="your-service.service")
for entry in reader:
if 'REACHED_START_LIMIT' in entry.get('MESSAGE', ''):
# Trigger your custom action
pass
Here's a complete implementation I've used in production:
[Unit]
Description=Critical DB Worker Service
OnFailure=notify-email@%i.service
[Service]
ExecStart=/usr/local/bin/db-worker
Restart=always
RestartSec=10s
StartLimitInterval=1h
StartLimitBurst=3
StartLimitAction=none
ExecStopPost=/usr/local/bin/service-monitor post-stop %n $SERVICE_RESULT $EXIT_CODE $EXIT_STATUS
[Install]
WantedBy=multi-user.target
The monitor script handles various failure scenarios differently:
#!/bin/bash
ACTION="$1"
SERVICE="$2"
RESULT="$3"
EXIT_CODE="$4"
EXIT_STATUS="$5"
case "$ACTION" in
post-stop)
check_start_limit "$SERVICE"
log_failure "$RESULT" "$EXIT_CODE"
;;
esac