How to Execute Custom Commands When systemd Service Hits Start Limit Threshold


2 views

When dealing with critical system services, we often configure aggressive restart policies in systemd:

[Service]
Restart=always
RestartSec=5s
StartLimitInterval=300s
StartLimitBurst=5

The problem emerges when we need custom behavior at the start limit boundary - systemd only offers extreme actions like reboot or poweroff through StartLimitAction.

We'll implement a two-layer monitoring system:

  1. Primary service with standard restart policy
  2. Failure detection unit that triggers only at limit threshold

1. Configure the Main Service

[Unit]
Description=MySQL Maintenance Service
OnFailure=notify-failure@%i.service

[Service]
ExecStart=/usr/local/bin/mysql-maintenance
Restart=always
RestartSec=30s
StartLimitInterval=5min
StartLimitBurst=3

2. Create the Threshold Detection Service

[Unit]
Description=Service Failure Threshold Notifier %I

[Service]
Type=oneshot
ExecStart=/usr/local/bin/threshold-handler %i
User=alertuser

3. The Threshold Handler Script (/usr/local/bin/threshold-handler):

#!/bin/bash
SERVICE=$1
FAIL_COUNT=$(systemctl show -p NRestarts $SERVICE | cut -d= -f2)
LIMIT_BURST=$(systemctl show -p StartLimitBurst $SERVICE | cut -d= -f2)

if [ "$FAIL_COUNT" -ge "$LIMIT_BURST" ]; then
  /usr/local/bin/send-alert --service "$SERVICE" \
    --failure-count "$FAIL_COUNT" \
    --hostname "$(hostname)"
fi

To avoid false positives, enhance the handler script with state verification:

#!/bin/bash
SERVICE=$1

# Get systemd properties
ACTIVE_STATE=$(systemctl show -p ActiveState $SERVICE | cut -d= -f2)
SUB_STATE=$(systemctl show -p SubState $SERVICE | cut -d= -f2)
FAIL_COUNT=$(systemctl show -p NRestarts $SERVICE | cut -d= -f2)
LIMIT_BURST=$(systemctl show -p StartLimitBurst $SERVICE | cut -d= -f2)

# Only trigger if both conditions are met:
# 1. Service is in failed state
# 2. Restart attempts exhausted
if [ "$ACTIVE_STATE" = "failed" ] && \
   [ "$SUB_STATE" = "failed" ] && \
   [ "$FAIL_COUNT" -ge "$LIMIT_BURST" ]; then
    # Custom handling logic here
fi

For more complex scenarios, consider monitoring the service state file:

[Unit]
Description=Watch for service failure threshold

[Path]
PathModified=/run/systemd/system/mysql-maintenance.service
Unit=threshold-handler.service

[Install]
WantedBy=multi-user.target

The handler service would then parse the service state from /run/systemd/system/.

  • Add rate limiting to the alert mechanism
  • Include service logs in notifications
  • Monitor handler script execution time
  • Implement proper error handling for the handler

When designing robust systemd services, we often use Restart=always to maintain service availability. However, this creates a new problem - how to distinguish between temporary failures (where auto-restart makes sense) and permanent failures that require human intervention.

[Service]
Restart=always
StartLimitInterval=60s
StartLimitBurst=5

The default systemd options only provide nuclear options when start limits are reached:

[Service]
StartLimitAction=reboot|poweroff|none

What we really want is to execute custom logic - sending alerts, running cleanup scripts, or triggering other services.

Here's how to implement a sophisticated failure handling system:

[Unit]
OnFailure=notify-email@%i.service

[Service]
Restart=always
RestartSec=5s
StartLimitInterval=300s
StartLimitBurst=5
StartLimitAction=none

# This script checks if we hit start limit
ExecStopPost=/usr/local/bin/check-start-limit.sh %n

Create /usr/local/bin/check-start-limit.sh:

#!/bin/bash
SERVICE="$1"
MAX_RETRIES=5

systemctl show -p NRestarts "$SERVICE" | grep -q "NRestarts=$MAX_RETRIES" && \
    systemctl start start-limit-handler@$SERVICE

Add a template service for limit handling:

[Unit]
Description=Start limit handler for %i

[Service]
Type=oneshot
ExecStart=/usr/local/bin/handle-start-limit.sh %i

To prevent duplicate alerts, add rate limiting to your handler:

#!/bin/bash
SERVICE="$1"
LOCK_FILE="/tmp/${SERVICE}_limit_alert.lock"

# Only send alert if lock doesn't exist or is older than 1 hour
if [ ! -f "$LOCK_FILE" ] || [ $(($(date +%s) - $(stat -c %Y "$LOCK_FILE"))) -gt 3600 ]; then
    touch "$LOCK_FILE"
    # Your alert logic here
    echo "Service $SERVICE hit start limit" | mail -s "Alert" admin@example.com
fi

For more complex scenarios, you can monitor service states programmatically:

#!/usr/bin/python3
import systemd.journal
from systemd import daemon

reader = systemd.journal.Reader()
reader.add_match(_SYSTEMD_UNIT="your-service.service")

for entry in reader:
    if 'REACHED_START_LIMIT' in entry.get('MESSAGE', ''):
        # Trigger your custom action
        pass

Here's a complete implementation I've used in production:

[Unit]
Description=Critical DB Worker Service
OnFailure=notify-email@%i.service

[Service]
ExecStart=/usr/local/bin/db-worker
Restart=always
RestartSec=10s
StartLimitInterval=1h
StartLimitBurst=3
StartLimitAction=none
ExecStopPost=/usr/local/bin/service-monitor post-stop %n $SERVICE_RESULT $EXIT_CODE $EXIT_STATUS

[Install]
WantedBy=multi-user.target

The monitor script handles various failure scenarios differently:

#!/bin/bash
ACTION="$1"
SERVICE="$2"
RESULT="$3"
EXIT_CODE="$4"
EXIT_STATUS="$5"

case "$ACTION" in
    post-stop)
        check_start_limit "$SERVICE"
        log_failure "$RESULT" "$EXIT_CODE"
        ;;
esac