How to Handle Non-Zero Exit Codes in Systemd Timers Without Failing Service State


2 views

When running scheduled backups via systemd timers, we often encounter a frustrating scenario: the backup script completes its primary function but returns a non-zero exit code due to non-critical warnings. While technically successful, this triggers systemd's failure state mechanism, potentially preventing subsequent timer executions.

Systemd treats any non-zero exit code as a failure by default. This becomes problematic with:

[Unit]
Description=Backup Service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/bin/docker run --rm backup-image

The service enters failed state when the container exits with warnings, despite successful backup completion.

1. Exit Code Normalization in Entrypoint

Modify your container's entrypoint to always return success:

#!/bin/bash
/backup/script.sh || true  # Forces exit code 0

2. Systemd Service Configuration Options

Use these directives in your service unit:

[Service]
SuccessExitStatus=0 1 2  # Accepts multiple exit codes
RestartForceExitStatus=   # Empty means don't restart

3. Post-Execution Cleanup

Add a reset mechanism:

[Service]
ExecStopPost=/bin/bash -c "systemctl reset-failed %n"

4. Timer-Specific Configuration

Ensure your timer unit includes:

[Timer]
Unit=backup.service
Persistent=true  # Ensures missed runs are executed

For more sophisticated monitoring:

[Service]
ExecStart=/usr/bin/bash -c '/backup/script.sh; echo $? > /run/backup.status'

Maintain visibility while handling exit codes:

[Service]
StandardOutput=journal
StandardError=journal
LogLevelMax=warning  # Filters out debug noise

When running backup services in containers managed by systemd and fleet, we frequently encounter a frustrating scenario: the backup script completes successfully but returns non-zero exit codes due to non-critical warnings. This causes the service to enter a failed state, which then prevents the associated timer from triggering subsequent executions.

Systemd treats any non-zero exit code as a failure by default. While this makes sense for most services, it becomes problematic for backup operations where:

  • Warnings about non-critical files are common
  • Partial success is still valuable
  • We want the timer to continue triggering regardless

1. Custom Exit Code Handling in Service Unit

Modify your backup.service to ignore specific exit codes:

[Service]
ExecStart=/usr/bin/docker run --rm backup-container
SuccessExitStatus=0 1 2
Restart=on-failure
RestartSec=60s

2. Wrapper Script Approach

Create a wrapper script that handles exit code conversion:

#!/bin/bash
/backup/actual-script.sh
exit_status=$?
if [ $exit_status -eq 1 ]; then
    # Known warning condition
    exit 0
else
    exit $exit_status
fi

3. Automatic State Reset

Use ExecStopPost to clean the failed state:

[Service]
ExecStart=/usr/bin/docker run --rm backup-container
ExecStopPost=/usr/bin/systemctl reset-failed backup.service

For fleet environments, you might need to combine these approaches. The most robust solution is often:

[Unit]
Description=Backup Service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup-wrapper
SuccessExitStatus=0 1
ExecStopPost=/bin/sh -c "systemctl reset-failed %n"

[Install]
WantedBy=multi-user.target

For critical backups, consider implementing explicit success tracking:

[Service]
ExecStart=/bin/bash -c '/backup/script.sh && touch /var/run/backup.success'
ExecStopPost=/bin/bash -c 'if [ -f /var/run/backup.success ]; then rm /var/run/backup.success; exit 0; else exit 1; fi'