When working with message queue workers (like RabbitMQ consumers), you might encounter situations where immediate restarts aren't ideal. Supervisord's default behavior is to restart processes immediately after they exit, which can cause:
- Resource contention if the worker needs time to release connections
- Message processing storms when many messages are queued
- Potential race conditions in distributed systems
The startsecs
parameter in your program configuration controls how long supervisord waits before considering a started process as "successfully started". However, we can creatively use this for our delay purpose:
[program:my_worker]
command=/path/to/your/worker
autostart=true
autorestart=true
startsecs=5 ; Wait 5 seconds before considering start successful
For more sophisticated control, combine these parameters:
[program:delayed_worker]
command=/path/to/worker
autostart=true
autorestart=true
startretries=3
startsecs=10
stopwaitsecs=15
exitcodes=0,2
Sometimes it's cleaner to handle the delay within your worker code. Here's a Python example:
import time
import sys
def main():
try:
# Your normal worker logic here
process_messages()
except Exception as e:
print(f"Error occurred: {e}", file=sys.stderr)
time.sleep(10) # Wait 10 seconds before exiting
sys.exit(1) # Non-zero exit triggers supervisord restart
if __name__ == "__main__":
main()
For maximum flexibility, create a wrapper script that handles the delay:
#!/bin/bash
# Worker process
/path/to/real/worker "$@"
exit_code=$?
# Only sleep if the exit was non-zero
if [ $exit_code -ne 0 ]; then
sleep 15
fi
exit $exit_code
Then configure supervisord to use the wrapper:
[program:wrapped_worker]
command=/path/to/wrapper_script.sh
When implementing restart delays, ensure proper logging:
[program:logged_worker]
command=/path/to/worker
stdout_logfile=/var/log/worker.out.log
stderr_logfile=/var/log/worker.err.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=10
When managing worker processes that consume RabbitMQ messages, you might encounter situations where immediate restarts aren't desirable. Supervisord's default behavior is to restart processes immediately after they exit (when autorestart=true
), which can lead to:
- Resource contention if the worker exits due to temporary system load
- Message processing storms when workers fail frequently
- Inefficient backoff patterns during temporary outages
Supervisord provides two key configuration parameters to control restart timing:
[program:my_worker]
command=/path/to/worker_script.py
autorestart=true
startsecs=10 ; Wait 10 seconds before considering start successful
startretries=3 ; Number of serial failure attempts before giving up
However, this doesn't fully solve our problem - we need to add a delay between restart attempts.
Here are three effective approaches:
1. Using stopwaitsecs
This controls how long supervisord waits after sending a stop signal before killing the process:
[program:my_worker]
command=/path/to/worker_script.py
autorestart=true
stopwaitsecs=30 ; Add delay between stop and next start
exitcodes=0,2 ; Only restart on these exit codes
2. Exit Code Strategy with Sleep
Modify your worker to include an exit code that triggers a delayed restart:
# In your worker script:
import sys, time
try:
# Normal processing
process_message()
except TemporaryError:
time.sleep(15) # Worker-controlled delay
sys.exit(75) # Special exit code
# supervisord config:
[program:my_worker]
command=/path/to/worker_script.py
autorestart=unexpected
exitcodes=0,75
3. Combined Approach with startsecs
The most reliable solution combines multiple parameters:
[program:rabbitmq_worker]
command=/path/to/worker
autorestart=true
startsecs=15 ; Minimum time before considering start successful
stopwaitsecs=10 ; Grace period for shutdown
startretries=5 ; Max attempts before giving up
exitcodes=0,2 ; Only restart on clean exits
user=workeruser
directory=/tmp
environment=RABBITMQ_CONSUMER=true
For more complex control, use Supervisord's event listener system:
[eventlistener:delayed_restart]
command=/path/to/delayed_restart_listener.py
events=PROCESS_STATE_EXITED
autorestart=true
[program:my_worker]
command=/path/to/worker_script.py
autorestart=false ; Let listener handle restarts
The listener script would handle the timing logic:
#!/usr/bin/env python
import sys
import time
from supervisor import childutils
def main():
while True:
headers, payload = childutils.listener.wait(sys.stdin, sys.stdout)
if headers['eventname'] == 'PROCESS_STATE_EXITED':
time.sleep(30) # Custom delay
childutils.listener.ok(sys.stdout)
# Add logic to decide whether to restart
if __name__ == '__main__':
main()
After implementing restart delays, monitor with:
supervisorctl tail my_worker
supervisorctl status
journalctl -u supervisord -f
Key metrics to watch:
- Restart frequency (should decrease)
- Message processing throughput
- System resource usage during spikes