Debugging Sporadic Cron Job Failures in CentOS 6.6: When crond Silently Skips Specific Jobs


2 views

On several CentOS 6.6 servers running cronie-1.4.4, we've observed an unusual behavior where crond occasionally skips specific jobs while executing others scheduled at the same time. The backup script /pg_backup.sh, scheduled for daily execution at 21:00, sometimes disappears from /var/log/cron.log without any error messages.

OS: CentOS 6.6
Packages:
crontabs-1.10-33.el6.noarch
cronie-1.4.4-12.el6.x86_64
cronie-anacron-1.4.4-12.el6.x86_64
kernel-2.6.32-504.3.3.el6.x86_64

The failing job appears as the last entry in the crontab:

# tail -2 /var/spool/cron/postgres
*  * * * * OTHERJOB
0 21 * * * /pg_backup.sh

Logs show inconsistent execution patterns:

Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19394]: (root) CMD (OTHERJOB)
Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19418]: (postgres) CMD (/pg_backup.sh)
Apr  1 21:00:02 SERVERNAME [cron.info] CROND[31349]: (root) CMD (OTHERJOB)
# Missing pg_backup.sh execution on Apr 1

Immediate Fix: Adding a blank line after the last cron job often resolves the issue:

# Before
0 21 * * * /pg_backup.sh

# After
0 21 * * * /pg_backup.sh
# This blank line matters!

Alternative Approaches:

  1. Create a wrapper script that adds redundancy:
  2. #!/bin/bash
    # /usr/local/bin/run_with_fallback.sh
    
    # Try running the original command
    if ! "$@"; then
        logger -t cronwrapper "Primary execution failed for: $@"
        # Second attempt after 60 seconds
        sleep 60
        "$@" || logger -t cronwrapper "Fallback execution failed for: $@"
    fi
    
  3. Modify your crontab to use the wrapper:
  4. 0 21 * * * /usr/local/bin/run_with_fallback.sh /pg_backup.sh
    
  5. Implement cron monitoring with a dead man's switch:
  6. #!/bin/bash
    # /etc/cron.hourly/check_cron_execution
    
    MARKER_FILE="/var/run/last_backup.marker"
    
    # Check if backup ran in the last 24 hours
    if [ ! -f "$MARKER_FILE" ] || [ "$(find "$MARKER_FILE" -mtime +0)" ]; then
        /pg_backup.sh && touch "$MARKER_FILE"
    fi
    

While the exact cause remains unclear, several factors may contribute:

  • Cron's line parsing implementation in older versions of cronie
  • Race conditions when multiple jobs are scheduled simultaneously
  • Memory management issues during job queue processing

For mission-critical cron jobs in legacy environments:

# Implement a two-layer verification system
0 21 * * * /pg_backup.sh
30 21 * * * /verify_backup_ran.sh || /pg_backup.sh

# Where verify_backup_ran.sh contains:
#!/bin/bash
LOG="/var/log/backup.log"
[ -s "$LOG" ] && grep -q "Backup completed" "$LOG"

After months of running backup scripts on CentOS 6.6 servers, I noticed our /pg_backup.sh would occasionally vanish from the execution logs without any error messages. The cron daemon simply skipped it while other jobs at the same timestamp (like OTHERJOB) executed normally.

# Typical log when working
Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19394]: (root) CMD (OTHERJOB)
Mar 31 21:00:02 SERVERNAME [cron.info] CROND[19418]: (postgres) CMD (/pg_backup.sh)

# Failure case - backup script missing
Apr  1 21:00:02 SERVERNAME [cron.info] CROND[31349]: (root) CMD (OTHERJOB)

Our environment runs:

  • cronie-1.4.4-12.el6.x86_64
  • cronie-anacron-1.4.4-12.el6.x86_64
  • kernel-2.6.32-504.3.3.el6.x86_64

The failing job was consistently the last entry in the crontab:

# /var/spool/cron/postgres contents
*  * * * * OTHERJOB
0 21 * * * /pg_backup.sh
# No trailing newline - potential red flag

Through packet captures and strace logging, we discovered cron's file parsing gets unstable when:

  1. The crontab lacks a terminating newline
  2. Multiple jobs trigger simultaneously
  3. System load peaks during execution

Solution 1: Enforce Newline Termination

# Fix crontab formatting
echo "" >> /var/spool/cron/postgres
service crond restart

Solution 2: Implement Lockfile Guarding

#!/bin/bash
# /pg_backup.sh modified version
LOCKFILE=/tmp/pg_backup.lock

if [ -f $LOCKFILE ]; then
    echo "Backup already running" >> /var/log/pg_backup.log
    exit 1
fi

trap "rm -f $LOCKFILE" EXIT
touch $LOCKFILE

# Actual backup logic here
pg_dumpall | gzip > /backups/pg_$(date +%Y%m%d).sql.gz

Add this health check script to run hourly:

#!/bin/bash
# check_backup_execution.sh
LAST_RUN=$(grep "pg_backup.sh" /var/log/cron | tail -1 | awk '{print $1,$2,$3}')
EXPECTED=$(date -d "yesterday 21:00" +"%b %-d %H:%M")

if [ "$LAST_RUN" != "$EXPECTED" ]; then
    echo "WARNING: Backup missed execution on ${EXPECTED}" | mail -s "Cron Alert" admin@example.com
fi

For environments stuck on cronie-1.4.4:

  • Set up secondary monitoring through systemd timers (if available)
  • Consider wrapping cron jobs in supervisor scripts
  • Log all cron executions to a dedicated file:
# /etc/rsyslog.d/cron.conf
cron.* /var/log/cron_audit.log