How to Automatically Restart Crashed Varnish Processes: Monitoring Solutions with Monit and Alternatives


2 views

When critical processes like Varnish crash intermittently, it can wreak havoc on production environments. Many sysadmins face situations where the monitoring tool (in this case Monit) detects the failure but subsequent restart attempts fail, leaving the service down.

Your existing setup has a logical structure but might need some adjustments:

check process varnish with pidfile /var/run/varnish.pid
    start program = "/etc/init.d/varnish start" with timeout 60 seconds
    stop program = "/etc/init.d/varnish stop"
    if failed host 192.168.1.100 port 80 protocol http
        and request "/healthcheck" then restart
    if 3 restarts within 5 cycles then timeout
    group server

Several factors could cause restart failures:

  • PID file not being properly cleaned up
  • Port conflicts when restarting
  • Resource starvation preventing new instances
  • Improper shutdown sequences

Try this more robust configuration that handles edge cases better:

check process varnish with pidfile /var/run/varnish.pid
    start program = "/bin/bash -c '/etc/init.d/varnish stop; sleep 2; /etc/init.d/varnish start'"
        as uid varnish and gid varnish
        with timeout 90 seconds
    stop program = "/etc/init.d/varnish stop" 
        as uid varnish and gid varnish
        with timeout 30 seconds
    if failed host 127.0.0.1 port 80 
        protocol http request "/healthcheck" 
        with timeout 10 seconds for 3 times within 5 cycles 
        then restart
    if 5 restarts within 10 cycles then exec "/usr/local/bin/alert_admin.sh"
    depends on varnish_bin
    group cache_services

If Monit continues to be problematic, consider these alternatives:

Systemd Service Recovery

For systems using systemd, add these directives to your service unit file:

[Service]
Restart=on-failure
RestartSec=5s
StartLimitInterval=100s
StartLimitBurst=5

Supervisor Approach

Supervisord provides more sophisticated process control:

[program:varnish]
command=/usr/sbin/varnishd -f /etc/varnish/default.vcl -s malloc,256m
autostart=true
autorestart=true
startretries=3
stderr_logfile=/var/log/varnish.err.log
stdout_logfile=/var/log/varnish.out.log
user=varnish

When automatic restarts fail, check these components:

  1. Examine /var/log/varnish.log for startup errors
  2. Verify permissions on PID file directory
  3. Test manual start/stop sequences
  4. Check for port conflicts with netstat -tulnp | grep 80

A more comprehensive health check can prevent false positives:

#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/healthcheck)
if [ "$response" -eq 200 ]; then
    exit 0
elif varnishadm ping | grep -q "PONG"; then
    exit 0
else
    exit 1
fi

Many sysadmins face situations where Varnish - the high-performance HTTP accelerator - keeps crashing unexpectedly. The standard Monit configuration often fails to properly restart it, leaving your site vulnerable. Here's a deeper look at why this happens and how to fix it.

The typical Monit configuration has several potential failure points when dealing with Varnish:

check process varnish with pidfile /var/run/varnish.pid
    start program = "/etc/init.d/varnish start" with timeout 60 seconds
    stop program = "/etc/init.d/varnish stop"
    if failed host 127.0.0.1 port 80 protocol http
        and request "/blank.html" then restart
    if 3 restarts within 5 cycles then timeout

Common issues include:

  • The pidfile location might be incorrect (varies by distro)
  • Init scripts might not properly clean up stale processes
  • Port 80 checks might fail even when Varnish is technically running

Option 1: Enhanced Monit Configuration

Try this more robust configuration that adds additional checks and proper cleanup:

check process varnish with pidfile /var/run/varnish.pid
    start program = "/bin/bash -c '/etc/init.d/varnish stop; sleep 2; /etc/init.d/varnish start'"
    stop program = "/etc/init.d/varnish stop"
    if failed host 127.0.0.1 port 80 protocol http 
        and request "/blank.html" with timeout 15 seconds for 3 times within 4 cycles then restart
    if 5 restarts within 5 cycles then exec "/bin/bash -c 'echo \"Varnish keeps crashing\" | mail -s \"Varnish Alert\" admin@example.com'"
    depends on varnish_bin
    group varnish

Option 2: Systemd-based Solution

For systems using systemd, create a service unit with automatic restart:

[Unit]
Description=Varnish HTTP accelerator
After=network.target

[Service]
Type=forking
Restart=always
RestartSec=5
PIDFile=/run/varnish.pid
ExecStart=/usr/sbin/varnishd -j unix,user=varnish -F -a :80 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,256m
ExecReload=/usr/sbin/varnishreload

[Install]
WantedBy=multi-user.target

Option 3: Supervisord Alternative

For more control, consider using Supervisord:

[program:varnish]
command=/usr/sbin/varnishd -j unix,user=varnish -F -a :80 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,256m
autostart=true
autorestart=true
startretries=5
stderr_logfile=/var/log/varnish/supervisor_err.log
stdout_logfile=/var/log/varnish/supervisor_out.log

When troubleshooting Varnish crashes:

  • Check shared memory allocation with varnishstat -1 -f MAIN.shm_*
  • Monitor worker threads: varnishstat -1 -f threads.*
  • Verify backend health: varnishlog -g request -q 'Backend_health'
  • Examine recent panics: journalctl -u varnish | grep panic

To reduce crashes:

  • Implement proper VCL error handling
  • Set conservative timeouts for backends
  • Monitor memory usage and adjust malloc allocation
  • Regularly check for and install security updates