Debugging Cacti Poller Hanging Issue: Database Connection Drops During Process Execution


3 views

During my recent Cacti (0.8.7g) deployment on Debian 6.0.5, I encountered a particularly frustrating issue where the poller would hang indefinitely while waiting for non-existent child processes. Here's the technical deep dive into what I discovered.

The primary symptom was graphs not updating, with the poller consistently exceeding its maximum runtime. The key error message was:

POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.

Manual execution revealed the poller would:

  1. Process data normally for about 1 minute
  2. Begin looping "Waiting on 1 of 1 pollers"
  3. Eventually time out after 298 seconds

The root cause emerged when examining the database interaction pattern:

// Original problematic query
$finished_processes = db_fetch_cell("SELECT count(*) FROM cacti.poller_time 
WHERE poller_id=0 AND end_time>'0000-00-00 00:00:00'");

Debug output showed the query would initially work, then suddenly return NULL:

Finished: 0 - Started: 1
Waiting on 1 of 1 pollers.
Finished: 1 - Started: 1
Finished:  - Started: 1  // NULL value appearing

Direct MySQLi queries confirmed the data was available, suggesting an ADOdb issue:

$mysqli = new mysqli("localhost","cacti","cacti","cacti");
$result = $mysqli->query("SELECT COUNT(*) FROM poller_time 
WHERE poller_id=0 AND end_time>'0000-00-00 00:00:00';");
$row = $result->fetch_assoc();
// Returns valid count while Cacti's query fails

Adding debug to ADOdb's _close() method revealed premature connection termination:

function _close() {
    @mysql_close($this->_connectionID);
    echo "!!!! CLOSED !!!!\n";
    debug_print_backtrace();
    $this->_connectionID = false;
}

The backtrace showed the connection was being closed during poller execution:

#0 ADODB_mysql->_close()
#1 ADOConnection->Close()
#2 db_close() called at [/usr/share/cacti/site/poller.php:455]

The fix involved modifying the connection handling in Cacti's database.php:

// Original problematic implementation
function db_close() {
    global $database_sessions;
    
    foreach($database_sessions as $id=>$conn) {
        $conn->Close();
    }
}

// Modified version with connection persistence
function db_close() {
    global $database_sessions;
    static $connection_cycles = 0;
    
    // Only close connections every 5 cycles
    if(++$connection_cycles >= 5) {
        foreach($database_sessions as $id=>$conn) {
            $conn->Close();
        }
        $connection_cycles = 0;
    }
}

Additionally, we modified the poller timeout handling:

// In poller.php, around line 368
$timeout = 60; // seconds to wait for child processes
$start_time = microtime(true);

while(true) {
    $finished = db_fetch_cell("SELECT...");
    if($finished >= $started) break;
    
    if((microtime(true) - $start_time) > $timeout) {
        cacti_log("WARNING: Poller child process timeout");
        break;
    }
    sleep(1);
}

When your Cacti poller gets stuck in a perpetual "Waiting on dead processes" loop, the root cause often lies in unexpected database connection termination. Here's what's happening under the hood:

// Typical symptom in cacti.log
POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
Waiting on 1 of 1 pollers. // This keeps repeating

The smoking gun appears when examining the ADODB MySQL driver behavior. The connection drops mid-execution, yet Cacti continues polling as if nothing happened:

// Debug output showing the disconnect
!!!! CLOSED !!!!
#0 ADODB_mysql->_close() called at [/usr/share/php/adodb/adodb.inc.php:2141]
#1 ADOConnection->Close() called at [/usr/share/cacti/site/lib/database.php:68]
#2 db_close() called at [/usr/share/cacti/site/poller.php:455]

Implement these modifications to lib/database.php to add connection verification:

function db_fetch_cell($sql, $colname = '') {
    static $connection_retries = 0;
    
    // Verify connection before query
    if (!db_connection_valid()) {
        if ($connection_retries++ < 3) {
            db_reconnect();
        } else {
            cacti_log("DB_CONNECTION: Failed after $connection_retries attempts");
            return NULL;
        }
    }
    
    // Original fetch logic
    $result = db_fetch_row($sql);
    if (empty($colname)) {
        return current($result);
    }
    return $result[$colname];
}

function db_connection_valid() {
    global $database_sessions;
    
    if (!isset($database_sessions['default'])) {
        return false;
    }
    
    try {
        return $database_sessions['default']->Execute("SELECT 1");
    } catch (Exception $e) {
        return false;
    }
}

Often the database disconnect coincides with RRDTool operations. Add this validation before critical operations:

// In poller.php, before rrdtool updates
if (!db_connection_valid()) {
    cacti_log("POLLER: Database connection lost before RRD updates");
    db_reconnect();
    
    // Re-validate poller_time table state
    $finished = db_fetch_cell("SELECT COUNT(*) FROM poller_time 
                             WHERE poller_id=0 AND end_time>'0000-00-00 00:00:00'");
    if ($finished === NULL) {
        exit("FATAL: Cannot re-establish database connection");
    }
}

Modify your include/config.php to enable more robust connection handling:

$database_type = "mysql";
$database_default = "cacti";
$database_hostname = "localhost";
$database_username = "cactiuser";
$database_password = "yourpassword";
$database_port = "3306";
$database_retries = 5; // Custom addition
$database_ssl = false; // Set true for cloud deployments

After implementing these changes, monitor your cacti.log for these healthy patterns:

POLLER: Poller[0] DB_RECONNECT: Re-established connection (attempt 1)
SYSTEM STATS: Time:1.8732 Method:cmd.php Processes:1 Threads:N/A Hosts:2
RRDsProcessed:6 // Should match your device count

The poller should now complete within normal execution timeframes without getting stuck in wait loops for non-existent processes.