How to Configure Supervisord to Delay Process Restart with a Custom Interval


3 views

When working with message queue workers (like RabbitMQ consumers), you might encounter situations where immediate restarts aren't ideal. Supervisord's default behavior is to restart processes immediately after they exit, which can cause:

  • Resource contention if the worker needs time to release connections
  • Message processing storms when many messages are queued
  • Potential race conditions in distributed systems

The startsecs parameter in your program configuration controls how long supervisord waits before considering a started process as "successfully started". However, we can creatively use this for our delay purpose:

[program:my_worker]
command=/path/to/your/worker
autostart=true
autorestart=true
startsecs=5  ; Wait 5 seconds before considering start successful

For more sophisticated control, combine these parameters:

[program:delayed_worker]
command=/path/to/worker
autostart=true
autorestart=true
startretries=3
startsecs=10
stopwaitsecs=15
exitcodes=0,2

Sometimes it's cleaner to handle the delay within your worker code. Here's a Python example:

import time
import sys

def main():
    try:
        # Your normal worker logic here
        process_messages()
    except Exception as e:
        print(f"Error occurred: {e}", file=sys.stderr)
        time.sleep(10)  # Wait 10 seconds before exiting
        sys.exit(1)  # Non-zero exit triggers supervisord restart

if __name__ == "__main__":
    main()

For maximum flexibility, create a wrapper script that handles the delay:

#!/bin/bash

# Worker process
/path/to/real/worker "$@"
exit_code=$?

# Only sleep if the exit was non-zero
if [ $exit_code -ne 0 ]; then
    sleep 15
fi

exit $exit_code

Then configure supervisord to use the wrapper:

[program:wrapped_worker]
command=/path/to/wrapper_script.sh

When implementing restart delays, ensure proper logging:

[program:logged_worker]
command=/path/to/worker
stdout_logfile=/var/log/worker.out.log
stderr_logfile=/var/log/worker.err.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=10

When managing worker processes that consume RabbitMQ messages, you might encounter situations where immediate restarts aren't desirable. Supervisord's default behavior is to restart processes immediately after they exit (when autorestart=true), which can lead to:

  • Resource contention if the worker exits due to temporary system load
  • Message processing storms when workers fail frequently
  • Inefficient backoff patterns during temporary outages

Supervisord provides two key configuration parameters to control restart timing:

[program:my_worker]
command=/path/to/worker_script.py
autorestart=true
startsecs=10  ; Wait 10 seconds before considering start successful
startretries=3 ; Number of serial failure attempts before giving up

However, this doesn't fully solve our problem - we need to add a delay between restart attempts.

Here are three effective approaches:

1. Using stopwaitsecs

This controls how long supervisord waits after sending a stop signal before killing the process:

[program:my_worker]
command=/path/to/worker_script.py
autorestart=true
stopwaitsecs=30  ; Add delay between stop and next start
exitcodes=0,2    ; Only restart on these exit codes

2. Exit Code Strategy with Sleep

Modify your worker to include an exit code that triggers a delayed restart:

# In your worker script:
import sys, time

try:
    # Normal processing
    process_message()
except TemporaryError:
    time.sleep(15)  # Worker-controlled delay
    sys.exit(75)    # Special exit code

# supervisord config:
[program:my_worker]
command=/path/to/worker_script.py
autorestart=unexpected
exitcodes=0,75

3. Combined Approach with startsecs

The most reliable solution combines multiple parameters:

[program:rabbitmq_worker]
command=/path/to/worker
autorestart=true
startsecs=15       ; Minimum time before considering start successful
stopwaitsecs=10    ; Grace period for shutdown
startretries=5     ; Max attempts before giving up
exitcodes=0,2      ; Only restart on clean exits
user=workeruser
directory=/tmp
environment=RABBITMQ_CONSUMER=true

For more complex control, use Supervisord's event listener system:

[eventlistener:delayed_restart]
command=/path/to/delayed_restart_listener.py
events=PROCESS_STATE_EXITED
autorestart=true

[program:my_worker]
command=/path/to/worker_script.py
autorestart=false  ; Let listener handle restarts

The listener script would handle the timing logic:

#!/usr/bin/env python
import sys
import time
from supervisor import childutils

def main():
    while True:
        headers, payload = childutils.listener.wait(sys.stdin, sys.stdout)
        if headers['eventname'] == 'PROCESS_STATE_EXITED':
            time.sleep(30)  # Custom delay
            childutils.listener.ok(sys.stdout)
            # Add logic to decide whether to restart
            
if __name__ == '__main__':
    main()

After implementing restart delays, monitor with:

supervisorctl tail my_worker
supervisorctl status
journalctl -u supervisord -f

Key metrics to watch:

  • Restart frequency (should decrease)
  • Message processing throughput
  • System resource usage during spikes