How to Execute Custom Commands When systemd Service Hits Start Limit Threshold

When dealing with critical system services, we often configure aggressive restart policies in systemd:

[Service]
Restart=always
RestartSec=5s
StartLimitInterval=300s
StartLimitBurst=5

The problem emerges when we need custom behavior at the start limit boundary - systemd only offers extreme actions like reboot or poweroff through StartLimitAction.

We'll implement a two-layer monitoring system:

Primary service with standard restart policy
Failure detection unit that triggers only at limit threshold

1. Configure the Main Service

[Unit]
Description=MySQL Maintenance Service
OnFailure=notify-failure@%i.service

[Service]
ExecStart=/usr/local/bin/mysql-maintenance
Restart=always
RestartSec=30s
StartLimitInterval=5min
StartLimitBurst=3

2. Create the Threshold Detection Service

[Unit]
Description=Service Failure Threshold Notifier %I

[Service]
Type=oneshot
ExecStart=/usr/local/bin/threshold-handler %i
User=alertuser

3. The Threshold Handler Script (/usr/local/bin/threshold-handler):

#!/bin/bash
SERVICE=$1
FAIL_COUNT=$(systemctl show -p NRestarts $SERVICE | cut -d= -f2)
LIMIT_BURST=$(systemctl show -p StartLimitBurst $SERVICE | cut -d= -f2)

if [ "$FAIL_COUNT" -ge "$LIMIT_BURST" ]; then
  /usr/local/bin/send-alert --service "$SERVICE" \
    --failure-count "$FAIL_COUNT" \
    --hostname "$(hostname)"
fi

To avoid false positives, enhance the handler script with state verification:

#!/bin/bash
SERVICE=$1

# Get systemd properties
ACTIVE_STATE=$(systemctl show -p ActiveState $SERVICE | cut -d= -f2)
SUB_STATE=$(systemctl show -p SubState $SERVICE | cut -d= -f2)
FAIL_COUNT=$(systemctl show -p NRestarts $SERVICE | cut -d= -f2)
LIMIT_BURST=$(systemctl show -p StartLimitBurst $SERVICE | cut -d= -f2)

# Only trigger if both conditions are met:
# 1. Service is in failed state
# 2. Restart attempts exhausted
if [ "$ACTIVE_STATE" = "failed" ] && \
   [ "$SUB_STATE" = "failed" ] && \
   [ "$FAIL_COUNT" -ge "$LIMIT_BURST" ]; then
    # Custom handling logic here
fi

For more complex scenarios, consider monitoring the service state file:

[Unit]
Description=Watch for service failure threshold

[Path]
PathModified=/run/systemd/system/mysql-maintenance.service
Unit=threshold-handler.service

[Install]
WantedBy=multi-user.target

The handler service would then parse the service state from /run/systemd/system/.

Add rate limiting to the alert mechanism
Include service logs in notifications
Monitor handler script execution time
Implement proper error handling for the handler

When designing robust systemd services, we often use Restart=always to maintain service availability. However, this creates a new problem - how to distinguish between temporary failures (where auto-restart makes sense) and permanent failures that require human intervention.

[Service]
Restart=always
StartLimitInterval=60s
StartLimitBurst=5

The default systemd options only provide nuclear options when start limits are reached:

[Service]
StartLimitAction=reboot|poweroff|none

What we really want is to execute custom logic - sending alerts, running cleanup scripts, or triggering other services.

Here's how to implement a sophisticated failure handling system:

[Unit]
OnFailure=notify-email@%i.service

[Service]
Restart=always
RestartSec=5s
StartLimitInterval=300s
StartLimitBurst=5
StartLimitAction=none

# This script checks if we hit start limit
ExecStopPost=/usr/local/bin/check-start-limit.sh %n

Create /usr/local/bin/check-start-limit.sh:

#!/bin/bash
SERVICE="$1"
MAX_RETRIES=5

systemctl show -p NRestarts "$SERVICE" | grep -q "NRestarts=$MAX_RETRIES" && \
    systemctl start start-limit-handler@$SERVICE

Add a template service for limit handling:

[Unit]
Description=Start limit handler for %i

[Service]
Type=oneshot
ExecStart=/usr/local/bin/handle-start-limit.sh %i

To prevent duplicate alerts, add rate limiting to your handler:

#!/bin/bash
SERVICE="$1"
LOCK_FILE="/tmp/${SERVICE}_limit_alert.lock"

# Only send alert if lock doesn't exist or is older than 1 hour
if [ ! -f "$LOCK_FILE" ] || [ $(($(date +%s) - $(stat -c %Y "$LOCK_FILE"))) -gt 3600 ]; then
    touch "$LOCK_FILE"
    # Your alert logic here
    echo "Service $SERVICE hit start limit" | mail -s "Alert" admin@example.com
fi

For more complex scenarios, you can monitor service states programmatically:

#!/usr/bin/python3
import systemd.journal
from systemd import daemon

reader = systemd.journal.Reader()
reader.add_match(_SYSTEMD_UNIT="your-service.service")

for entry in reader:
    if 'REACHED_START_LIMIT' in entry.get('MESSAGE', ''):
        # Trigger your custom action
        pass

Here's a complete implementation I've used in production:

[Unit]
Description=Critical DB Worker Service
OnFailure=notify-email@%i.service

[Service]
ExecStart=/usr/local/bin/db-worker
Restart=always
RestartSec=10s
StartLimitInterval=1h
StartLimitBurst=3
StartLimitAction=none
ExecStopPost=/usr/local/bin/service-monitor post-stop %n $SERVICE_RESULT $EXIT_CODE $EXIT_STATUS

[Install]
WantedBy=multi-user.target

The monitor script handles various failure scenarios differently:

#!/bin/bash
ACTION="$1"
SERVICE="$2"
RESULT="$3"
EXIT_CODE="$4"
EXIT_STATUS="$5"

case "$ACTION" in
    post-stop)
        check_start_limit "$SERVICE"
        log_failure "$RESULT" "$EXIT_CODE"
        ;;
esac

ServerDevWorker

How to Execute Custom Commands When systemd Service Hits Start Limit Threshold

Related Articles