Configuring Upstart Exponential Backoff for Process Respawn with Unlimited Retries


1 views

When dealing with critical services managed by Upstart, we often face a dilemma: aggressive respawn attempts can overwhelm a failing service, while giving up entirely isn't acceptable for mission-critical processes. The default Upstart behavior has two limitations:


# Default upstart respawn config example (problematic)
respawn
respawn limit 5 10

Here's how to implement exponential backoff while maintaining persistence:


# /etc/init/my-service.conf
description "Service with exponential backoff"

start on runlevel [2345]
stop on runlevel [016]

respawn
respawn limit unlimited

# Exponential backoff configuration
env BACKOFF=1
env MAX_BACKOFF=3600 # 1 hour
post-stop script
    if [ "$UPSTART_STOP_REASON" != "RESPAWN" ]; then
        exit 0
    fi
    
    current_backoff=$BACKOFF
    BACKOFF=$((current_backoff * 2))
    
    if [ "$BACKOFF" -gt "$MAX_BACKOFF" ]; then
        BACKOFF=$MAX_BACKOFF
    fi
    
    sleep $current_backoff
    exit 0
end script

exec /path/to/your/service

The configuration achieves both requirements through:

  • Unlimited retries: The respawn limit unlimited directive ensures the service will never stop attempting to restart
  • Exponential backoff: The post-stop script implements the backoff algorithm, doubling the delay after each failure up to the maximum threshold

For simpler cases where you just want increasing delays without unlimited retries:


respawn
respawn limit 10 3600 # 10 attempts within 1 hour

After implementing, verify the backoff behavior:


$ initctl status my-service
$ tail -f /var/log/upstart/my-service.log

# To monitor respawn attempts:
$ grep "my-service" /var/log/syslog | grep respawn

When deploying this pattern:

  • Set appropriate MAX_BACKOFF based on your service requirements
  • Combine with alerting to notify when respawn attempts reach maximum backoff
  • Consider implementing a circuit breaker pattern for services that may need temporary suspension

When dealing with critical services managed by Upstart, we often encounter scenarios where:

  • Immediate respawn attempts may overwhelm the system during temporary failures
  • Traditional fixed-interval retries either wait too long or retry too aggressively
# Bad example - fixed respawn interval
respawn
respawn limit unlimited

Here's how to implement progressive delays between respawn attempts:

# /etc/init/my-service.conf
description "My critical service with smart respawn"

start on runlevel [2345]
stop on runlevel [016]

respawn
respawn limit unlimited

# Backoff configuration
env INITIAL_DELAY=1
env MAX_DELAY=3600
env BACKOFF_FACTOR=2

script
    # Calculate dynamic delay
    if [ -f /var/run/my-service.respawn_count ]; then
        COUNT=$(cat /var/run/my-service.respawn_count)
        DELAY=$((INITIAL_DELAY * BACKOFF_FACTOR ** (COUNT - 1)))
        DELAY=$(( DELAY < MAX_DELAY ? DELAY : MAX_DELAY ))
        sleep $DELAY
        echo $((COUNT + 1)) > /var/run/my-service.respawn_count
    else
        echo 1 > /var/run/my-service.respawn_count
    fi
    
    # Actual service execution
    exec /usr/bin/my-service
end script

For systems where file-based counting isn't preferred:

post-stop script
    # Get current respawn count from Upstart's environment
    COUNT=$UPSTART_INSTANCE
    
    # Calculate exponential delay (minimum 1 second)
    DELAY=$((1 << (COUNT - 1)))
    DELAY=$(( DELAY < 3600 ? DELAY : 3600 ))
    
    # Log the delay for debugging
    logger -t my-service "Respawn attempt $COUNT, delaying for $DELAY seconds"
    
    # Implement the delay
    sleep $DELAY
end script

Check the behavior using:

initctl list | grep my-service
tail -f /var/log/upstart/my-service.log

The log should show increasing intervals between restart attempts, capped at one hour.

For particularly stubborn services, consider adding:

# Reset counter after successful run
post-start script
    if [ -f /var/run/my-service.respawn_count ]; then
        rm /var/run/my-service.respawn_count
    fi
end script