How to Handle Non-Zero Exit Codes in Systemd Timers Without Failing Service State

When running scheduled backups via systemd timers, we often encounter a frustrating scenario: the backup script completes its primary function but returns a non-zero exit code due to non-critical warnings. While technically successful, this triggers systemd's failure state mechanism, potentially preventing subsequent timer executions.

Systemd treats any non-zero exit code as a failure by default. This becomes problematic with:

[Unit]
Description=Backup Service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/bin/docker run --rm backup-image

The service enters failed state when the container exits with warnings, despite successful backup completion.

1. Exit Code Normalization in Entrypoint

Modify your container's entrypoint to always return success:

#!/bin/bash
/backup/script.sh || true  # Forces exit code 0

2. Systemd Service Configuration Options

Use these directives in your service unit:

[Service]
SuccessExitStatus=0 1 2  # Accepts multiple exit codes
RestartForceExitStatus=   # Empty means don't restart

3. Post-Execution Cleanup

Add a reset mechanism:

[Service]
ExecStopPost=/bin/bash -c "systemctl reset-failed %n"

4. Timer-Specific Configuration

Ensure your timer unit includes:

[Timer]
Unit=backup.service
Persistent=true  # Ensures missed runs are executed

For more sophisticated monitoring:

[Service]
ExecStart=/usr/bin/bash -c '/backup/script.sh; echo $? > /run/backup.status'

Maintain visibility while handling exit codes:

[Service]
StandardOutput=journal
StandardError=journal
LogLevelMax=warning  # Filters out debug noise

When running backup services in containers managed by systemd and fleet, we frequently encounter a frustrating scenario: the backup script completes successfully but returns non-zero exit codes due to non-critical warnings. This causes the service to enter a failed state, which then prevents the associated timer from triggering subsequent executions.

Systemd treats any non-zero exit code as a failure by default. While this makes sense for most services, it becomes problematic for backup operations where:

Warnings about non-critical files are common
Partial success is still valuable
We want the timer to continue triggering regardless

1. Custom Exit Code Handling in Service Unit

Modify your backup.service to ignore specific exit codes:

[Service]
ExecStart=/usr/bin/docker run --rm backup-container
SuccessExitStatus=0 1 2
Restart=on-failure
RestartSec=60s

2. Wrapper Script Approach

Create a wrapper script that handles exit code conversion:

#!/bin/bash
/backup/actual-script.sh
exit_status=$?
if [ $exit_status -eq 1 ]; then
    # Known warning condition
    exit 0
else
    exit $exit_status
fi

3. Automatic State Reset

Use ExecStopPost to clean the failed state:

[Service]
ExecStart=/usr/bin/docker run --rm backup-container
ExecStopPost=/usr/bin/systemctl reset-failed backup.service

For fleet environments, you might need to combine these approaches. The most robust solution is often:

[Unit]
Description=Backup Service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup-wrapper
SuccessExitStatus=0 1
ExecStopPost=/bin/sh -c "systemctl reset-failed %n"

[Install]
WantedBy=multi-user.target

For critical backups, consider implementing explicit success tracking:

[Service]
ExecStart=/bin/bash -c '/backup/script.sh && touch /var/run/backup.success'
ExecStopPost=/bin/bash -c 'if [ -f /var/run/backup.success ]; then rm /var/run/backup.success; exit 0; else exit 1; fi'

ServerDevWorker