When dealing with critical services managed by Upstart, we often face a dilemma: aggressive respawn attempts can overwhelm a failing service, while giving up entirely isn't acceptable for mission-critical processes. The default Upstart behavior has two limitations:
# Default upstart respawn config example (problematic)
respawn
respawn limit 5 10
Here's how to implement exponential backoff while maintaining persistence:
# /etc/init/my-service.conf
description "Service with exponential backoff"
start on runlevel [2345]
stop on runlevel [016]
respawn
respawn limit unlimited
# Exponential backoff configuration
env BACKOFF=1
env MAX_BACKOFF=3600 # 1 hour
post-stop script
if [ "$UPSTART_STOP_REASON" != "RESPAWN" ]; then
exit 0
fi
current_backoff=$BACKOFF
BACKOFF=$((current_backoff * 2))
if [ "$BACKOFF" -gt "$MAX_BACKOFF" ]; then
BACKOFF=$MAX_BACKOFF
fi
sleep $current_backoff
exit 0
end script
exec /path/to/your/service
The configuration achieves both requirements through:
- Unlimited retries: The
respawn limit unlimited
directive ensures the service will never stop attempting to restart - Exponential backoff: The post-stop script implements the backoff algorithm, doubling the delay after each failure up to the maximum threshold
For simpler cases where you just want increasing delays without unlimited retries:
respawn
respawn limit 10 3600 # 10 attempts within 1 hour
After implementing, verify the backoff behavior:
$ initctl status my-service
$ tail -f /var/log/upstart/my-service.log
# To monitor respawn attempts:
$ grep "my-service" /var/log/syslog | grep respawn
When deploying this pattern:
- Set appropriate MAX_BACKOFF based on your service requirements
- Combine with alerting to notify when respawn attempts reach maximum backoff
- Consider implementing a circuit breaker pattern for services that may need temporary suspension
When dealing with critical services managed by Upstart, we often encounter scenarios where:
- Immediate respawn attempts may overwhelm the system during temporary failures
- Traditional fixed-interval retries either wait too long or retry too aggressively
# Bad example - fixed respawn interval
respawn
respawn limit unlimited
Here's how to implement progressive delays between respawn attempts:
# /etc/init/my-service.conf
description "My critical service with smart respawn"
start on runlevel [2345]
stop on runlevel [016]
respawn
respawn limit unlimited
# Backoff configuration
env INITIAL_DELAY=1
env MAX_DELAY=3600
env BACKOFF_FACTOR=2
script
# Calculate dynamic delay
if [ -f /var/run/my-service.respawn_count ]; then
COUNT=$(cat /var/run/my-service.respawn_count)
DELAY=$((INITIAL_DELAY * BACKOFF_FACTOR ** (COUNT - 1)))
DELAY=$(( DELAY < MAX_DELAY ? DELAY : MAX_DELAY ))
sleep $DELAY
echo $((COUNT + 1)) > /var/run/my-service.respawn_count
else
echo 1 > /var/run/my-service.respawn_count
fi
# Actual service execution
exec /usr/bin/my-service
end script
For systems where file-based counting isn't preferred:
post-stop script
# Get current respawn count from Upstart's environment
COUNT=$UPSTART_INSTANCE
# Calculate exponential delay (minimum 1 second)
DELAY=$((1 << (COUNT - 1)))
DELAY=$(( DELAY < 3600 ? DELAY : 3600 ))
# Log the delay for debugging
logger -t my-service "Respawn attempt $COUNT, delaying for $DELAY seconds"
# Implement the delay
sleep $DELAY
end script
Check the behavior using:
initctl list | grep my-service
tail -f /var/log/upstart/my-service.log
The log should show increasing intervals between restart attempts, capped at one hour.
For particularly stubborn services, consider adding:
# Reset counter after successful run
post-start script
if [ -f /var/run/my-service.respawn_count ]; then
rm /var/run/my-service.respawn_count
fi
end script