Implementing High Availability for Cron Jobs on Debian: Failover Strategies Without Heartbeat


2 views

When running critical cron jobs across redundant Debian servers, we face a common challenge: ensuring a job runs only once despite having multiple servers. Traditional approaches like file locking don’t handle server failures elegantly. Here’s a lightweight HA solution without relying on Heartbeat/Pacemaker.

Leverage atomic filesystem operations to "claim" cron execution rights. This script (/usr/local/bin/cron-leader) checks which server owns the lock:

#!/bin/bash
LOCK_FILE="/etc/cron.d/ha-cron.lock"
THIS_SERVER=$(hostname -s)

# Attempt to claim the lock
if ln -s "$THIS_SERVER" "$LOCK_FILE" 2>/dev/null; then
  echo "[$(date)] Lock acquired by $THIS_SERVER"
  exit 0
else
  echo "[$(date)] Lock held by $(readlink "$LOCK_FILE")"
  exit 1
fi

In /etc/cron.d/ha-job:

# Run every hour, but only on the leader
0 * * * * root /usr/local/bin/cron-leader && /path/to/actual/script.sh

For more robust failure detection, use Consul’s session-based locks:

consul lock -child-exit-code -verbose cron-jobs /path/to/script.sh
  • NFS Consideration: If using shared storage, ensure flock works across nodes
  • Clock Drift: Synchronize time with chrony or ntpd
  • Cleanup: Add trap handlers to remove stale locks

When implementing cron-based automation across multiple servers, we often face the "double execution" problem. Here's a lightweight solution using file-based locking that works without complex HA frameworks like Heartbeat/Pacemaker.


#!/bin/bash
# /usr/local/bin/cron-failover.sh

LOCK_FILE="/mnt/nas-share/cron.lock"
SERVER_ID=$(hostname -s)

# Acquire lock with 5 minute timeout (300 seconds)
if ln -s "$SERVER_ID" "$LOCK_FILE" 2>/dev/null || 
   [ "$(find "$LOCK_FILE" -mmin +5 2>/dev/null)" ] || 
   [ "$(readlink "$LOCK_FILE")" = "$SERVER_ID" ]; then
    ln -sf "$SERVER_ID" "$LOCK_FILE"
else
    exit 0
fi

# Your actual cron job commands here
/path/to/your/script.sh

# Optional: Release lock when done
# rm -f "$LOCK_FILE"

For the lock file mechanism to work reliably, consider these shared storage options:

  • NFS share mounted on both servers
  • GlusterFS distributed filesystem
  • DRBD block-level replication

# MySQL/MariaDB implementation example
LOCK_QUERY="INSERT INTO cron_locks (job_name, server, timestamp) 
            VALUES ('${JOB_NAME}', '${SERVER_ID}', NOW())
            ON DUPLICATE KEY UPDATE 
            server = IF(timestamp < DATE_SUB(NOW(), INTERVAL 5 MINUTE), 
                       VALUES(server), server),
            timestamp = IF(server = VALUES(server), 
                         VALUES(timestamp), timestamp)"

# /etc/cron.d/HA-job
* * * * * root /usr/local/bin/cron-failover.sh > /var/log/ha-cron.log 2>&1

Implement these checks to ensure failover reliability:


# Monitoring script example
#!/bin/bash
LOCK_AGE=$(stat -c %Y /mnt/nas-share/cron.lock 2>/dev/null || echo 0)
CURRENT_TIME=$(date +%s)
MAX_AGE=600 # 10 minutes

if [ $((CURRENT_TIME - LOCK_AGE)) -gt $MAX_AGE ]; then
    echo "CRITICAL: Cron lock file stale" | mail -s "Cron Failover Alert" admin@example.com
fi