How to Configure Monit Check Interval for Slow-Starting Processes (Resque PID File Case)


2 views

When monitoring services like Resque with Monit, a common challenge arises during startup: Monit's default check interval might be too aggressive for processes that take significant time to initialize and create their PID files. The standard 30-second check cycle can lead to false negatives where Monit prematurely concludes the service failed to start, triggering unnecessary restart attempts.

The key configuration directives for this scenario are:

check process resque_worker
  with pidfile /path/to/resque.pid
  start program = "/usr/bin/env RAILS_ENV=production /app/script/resque start"
  stop program = "/usr/bin/env RAILS_ENV=production /app/script/resque stop"
  if changed pid then alert
  every 5 cycles  # Critical adjustment

For a Resque worker that typically takes 2 minutes to fully initialize:

# Adjust check interval (Monit runs checks every 30s by default)
set daemon 60  # Check every 60 seconds instead of 30

check process resque_worker
  with pidfile /var/run/resque.pid
  start program = "/bin/bash -lc 'cd /app && bundle exec rake resque:work QUEUE=*'"
  stop program = "/bin/bash -lc 'kill -QUIT cat /var/run/resque.pid'"
  if does not exist for 3 cycles then start  # 3 minutes grace period
  if 3 restarts within 5 cycles then timeout

Another approach is modifying your startup script to create the PID file immediately, then update it later:

#!/bin/bash
# Pre-create PID file
echo $$ > /tmp/resque.pid.tmp

# Actual process start in background
bundle exec rake resque:work QUEUE=* &

# Update PID file when ready
sleep 15
mv /tmp/resque.pid.tmp /var/run/resque.pid

After making changes, test your setup:

monit validate  # Check syntax
monit reload    # Apply changes
monit status    # Verify new intervals

If issues persist, consider these diagnostic steps:

  • Check Monit logs: tail -f /var/log/monit.log
  • Verify PID file permissions
  • Test startup time manually: time your_start_command

When dealing with services like Resque that take significant time to initialize and create their PID files, Monit's default check interval (typically 30 seconds) can cause duplicate process spawning. This occurs because:

  • Monit performs its first check before PID file exists
  • The service appears "not started" to Monit
  • Monit initiates a second start attempt
  • Original process finally creates PID file
  • Now you have duplicate processes running

For Resque specifically, we need to adjust two key parameters in the Monit configuration:

check process resque_worker
  with pidfile /path/to/resque.pid
  start program = "/usr/bin/env rake resque:work QUEUE=*" 
  stop program = "/bin/kill -TERM cat /path/to/resque.pid"
  # Critical timing parameters:
  every 5 cycles               # Check every 5 monitoring cycles (default 30s ×5 = 150s)
  if changed pid then alert    # Only alert on PID changes
  if does not exist for 2 cycles then start  # Wait 2 cycles before restarting (60s)

Sometimes adjusting Monit isn't enough. Here are additional approaches:

# Option 1: Delay in startup script
start program = "/bin/sleep 30 && /usr/bin/env rake resque:work QUEUE=*"

# Option 2: Use process matching instead of PID file
if failed port 5672 protocol amqp then restart

For different service types, consider these timing guidelines:

Service Type Recommended Check Interval Start Timeout
Lightweight services 1 cycle (30s) 1 cycle
Medium services (Resque) 3-5 cycles 2 cycles
Heavy JVM services 10+ cycles 5 cycles

To verify your configuration is working:

# Check Monit's perspective
monit status resque_worker

# View timing logs
tail -f /var/log/monit.log | grep resque

# Manual timing test
time rake resque:work QUEUE=* environment=production