When working with critical services in Linux, we often need them to automatically recover from failures. The standard systemd approach of using Restart=always
with rate limiting works well for immediate restarts, but presents a gap when we want services to resume automatically after the rate limit window expires.
The standard configuration you've implemented:
[Service]
Restart=always
StartLimitInterval=90
StartLimitBurst=3
correctly limits restarts to 3 attempts within 90 seconds. However, systemd doesn't automatically attempt to restart the service after the interval expires - this is by design but not always the desired behavior.
To achieve persistent automatic restarts even after rate limiting, we need to modify our approach:
[Unit]
StartLimitIntervalSec=90
StartLimitBurst=3
[Service]
Restart=always
RestartSec=10
The key difference is moving the rate limiting parameters to the [Unit] section and adding RestartSec
to control the delay between restart attempts.
For more sophisticated control, consider this enhanced configuration:
[Unit]
Description=My Resilient Service
StartLimitIntervalSec=90
StartLimitBurst=3
[Service]
ExecStart=/usr/bin/my-service
Restart=always
RestartSec=10
# Wait 30 seconds after start before considering it successful
TimeoutStartSec=30
# Allow 10 seconds for normal shutdown
TimeoutStopSec=10
This configuration adds proper timeouts and makes the service more robust against various failure scenarios.
After implementing these changes, verify the behavior:
# Reload systemd configuration
sudo systemctl daemon-reload
# Restart your service
sudo systemctl restart your-service.service
# Monitor the service journal
journalctl -u your-service.service -f
For even more control, you could implement a watchdog timer:
[Unit]
Description=Service with Watchdog
ConditionACPower=true
[Service]
ExecStart=/usr/bin/my-service
Restart=on-failure
WatchdogSec=30
[Install]
WantedBy=multi-user.target
This approach gives you additional monitoring capabilities beyond simple restart logic.
If you encounter problems:
- Check
systemctl status your-service
for current state - Verify
systemctl show your-service | grep StartLimit
shows correct values - Ensure your service properly implements systemd's notification protocol if using advanced features
In production environments, consider:
- Implementing proper logging for restart events
- Setting up monitoring for repeated restart cycles
- Considering higher-level orchestration tools if restart logic becomes too complex
When configuring resilient services with systemd, we often face a tricky situation where the service stops attempting restarts after hitting the StartLimitBurst
threshold. While this prevents runaway restart loops, sometimes we need the service to automatically resume attempts after the cooldown period.
With this configuration:
[Service]
Restart=always
StartLimitInterval=90
StartLimitBurst=3
The service will:
- Restart automatically on failure (good)
- Stop after 3 quick failures (expected)
- Not automatically resume after 90 seconds (problematic)
To achieve true automatic recovery after the rate limit window, we need to combine several directives:
[Service]
Restart=always
RestartSec=5s
StartLimitInterval=90
StartLimitBurst=3
StartLimitAction=none
[Unit]
StartLimitIntervalSec=90
StartLimitBurst=3
The critical addition is StartLimitAction=none
which prevents systemd from taking any action when limits are hit. Combined with proper unit-level configuration, this allows the service to automatically resume attempts after the interval expires.
Here's a complete service file for a Node.js application:
[Unit]
Description=Node API Service
After=network.target
StartLimitIntervalSec=90
StartLimitBurst=3
[Service]
Type=simple
User=nodeuser
WorkingDirectory=/opt/node-app
ExecStart=/usr/bin/node index.js
Restart=always
RestartSec=10s
StartLimitInterval=90
StartLimitBurst=3
StartLimitAction=none
Environment=NODE_ENV=production
[Install]
WantedBy=multi-user.target
To test this configuration:
- Force crash your service 3 times quickly
- Check status with
systemctl status yourservice
- Wait 90+ seconds
- Verify automatic restart attempts with journalctl:
journalctl -u yourservice -f
For more complex scenarios, consider using a watchdog timer:
[Service]
WatchdogSec=30
Restart=on-watchdog
This provides additional monitoring between restart attempts.