Optimal URL Monitoring with Monit: Solving Nginx/PHP-FPM Crash Recovery


6 views

When running nginx with PHP-FPM, service degradation often manifests as 502 Bad Gateway errors while processes remain zombie-like in the system. Traditional process monitoring falls short because:

  • Processes may appear running while being non-functional
  • Multiple interdependent services (nginx + PHP-FPM) require coordinated recovery
  • Different failure thresholds demand graduated responses

Initial attempts used process monitoring with layered URL checks:

check process webserver with pidfile /var/run/nginx.pid
   if failed (url https://example.com/healthcheck) then alert
   if 2 failed cycles then restart
   if 4 failed cycles then reboot

This had three key weaknesses:

  1. Multiple HTTP requests per monitoring cycle
  2. Process-oriented when we really care about service availability
  3. Reboot conditions could trigger infinite loops

The refined solution shifts to host monitoring with state tracking:

check host webserver with address 127.0.0.1
  if failed port 443 protocol https
    with timeout 20 seconds
    and request "/healthcheck" 
    content = "healthy"
    for 2 cycles then restart
  if 2 restarts within 5 cycles then exec "/usr/local/bin/escalate-recovery"

Key lessons from production deployments:

  • Healthcheck Endpoint: Create a dedicated URL that tests both nginx routing and PHP execution:
    // healthcheck.php
    header('Cache-Control: no-cache');
    die('healthy');
  • Recovery Scripts: Coordinate service restarts properly:
    #!/bin/bash
    # /usr/local/bin/webserver-recover
    systemctl stop php-fpm nginx
    pkill -9 php-fpm
    pkill -9 nginx
    systemctl start php-fpm nginx
  • State Management: Avoid Monit's queued events issue by:
    # In monitrc
    set idfile /tmp/monit.id
    set statefile /tmp/monit.state

For critical systems, implement a multi-tier approach:

# Basic process monitoring
check process nginx with pidfile /var/run/nginx.pid

# Service-level monitoring
check host webserver with address 127.0.0.1
  if failed port 443 then alert
  if failed url /healthcheck then restart

# Synthetic transaction monitoring
check program api-test with path /usr/local/bin/api-smoketest
  if status != 0 for 2 cycles then alert
  • Over-aggressive rebooting: Use escalating responses (alert → restart → failover → reboot)
  • Single monitoring point: Monitor both localhost and external DNS endpoints
  • No post-mortem hooks: Always log state before recovery actions

When running nginx with PHP-FPM, service failures often manifest as 502 Bad Gateway errors while the processes themselves remain running. Traditional process monitoring alone isn't sufficient because:

  • PHP-FPM might hang while still showing as running
  • Nginx might fail to proxy requests properly
  • Ports might be open while service is unresponsive

Here's an optimized configuration that combines host checking with URL monitoring:

CHECK HOST webserver WITH ADDRESS 127.0.0.1
  START PROGRAM = "/etc/monit/webserver.start.sh"
  STOP PROGRAM = "/etc/monit/webserver.stop.sh"
  
  IF NOT EXIST THEN ALERT
  IF FAILED PORT 443 PROTOCOL HTTPS THEN ALERT
  
  IF FAILED (
    URL https://www.mydomain.com/healthcheck 
    CONTENT = "OK"
    TIMEOUT 15 SECONDS
    HTTP HEADERS [
      Host: www.mydomain.com
      Connection: close
    ]
  ) FOR 2 CYCLES THEN RESTART
  
  IF 3 RESTARTS WITHIN 10 CYCLES THEN EXEC "/usr/local/bin/escalate-alert.sh"

The healthcheck endpoint should be a simple PHP script that:

  1. Returns quickly (no database queries)
  2. Verifies PHP execution
  3. Includes basic system checks

Example healthcheck.php:

<?php
header('Content-Type: text/plain');
try {
    // Verify PHP can execute
    if (!function_exists('version_compare')) {
        throw new Exception('PHP core functions missing');
    }
    
    // Simple file system check
    if (!is_writable('/tmp')) {
        throw new Exception('Temp directory not writable');
    }
    
    echo "OK";
} catch (Exception $e) {
    header('HTTP/1.1 503 Service Unavailable');
    echo "ERROR: " . $e->getMessage();
}

To prevent the reboot loop mentioned in the logs, implement these safeguards:

# /usr/local/bin/escalate-alert.sh
#!/bin/bash

# Only reboot if previous reboot was >30 minutes ago
if [ -f /var/run/last_reboot ] && \
   [ $(($(date +%s) - $(date -r /var/run/last_reboot +%s))) -lt 1800 ]; then
    echo "Recent reboot detected - not rebooting again" | mail -s "Server Alert" admin@example.com
    exit 0
fi

touch /var/run/last_reboot
/sbin/reboot

For more complex scenarios, consider:

  • Adding secondary monitoring with check program scripts
  • Implementing socket connection tests
  • Using Monit's depends directive for service relationships

Example program check:

CHECK PROGRAM php-fpm-health WITH PATH "/usr/local/bin/check_php_fpm.sh"
  IF STATUS != 0 FOR 2 CYCLES THEN ALERT

Where check_php_fpm.sh might contain:

#!/bin/bash
# Check PHP-FPM socket responsiveness
if ! echo "" | timeout 2 php-cgi -q 2>/dev/null | grep -q PONG; then
    exit 1
fi
exit 0