RabbitMQ Node Not Running: Troubleshooting EPMD “node ‘rabbit’ not running at all” Error


1 views

When you encounter the epmd reports: node 'rabbit' not running at all error in RabbitMQ, it typically indicates a fundamental communication breakdown between the RabbitMQ node and the Erlang Port Mapper Daemon (EPMD). This often happens even after apparently successful service restarts.

First, let's verify the actual process status beyond the service commands:

ps aux | grep beam
sudo netstat -tulnp | grep 4369
sudo rabbitmqctl status

The most frequent culprits include:

  • Cookie mismatches between nodes
  • EPMD port conflicts
  • Incomplete previous node shutdown
  • File permission issues in /var/lib/rabbitmq

Try this comprehensive reset sequence:

sudo service rabbitmq-server stop
sudo pkill -f rabbitmq
sudo pkill -f epmd
sudo rm -rf /var/lib/rabbitmq/mnesia/
sudo rm /var/lib/rabbitmq/.erlang.cookie
sudo service rabbitmq-server start
sudo rabbitmqctl wait /var/lib/rabbitmq/mnesia/rabbit@$(hostname).pid

If the issue persists, debug the Erlang node startup:

RABBITMQ_LOG_BASE=/tmp/rabbitmq_logs \
RABBITMQ_NODENAME=debug@localhost \
sudo -u rabbitmq rabbitmq-server -detached

Then check the generated logs in /tmp/rabbitmq_logs/

For clustered environments, ensure consistent cookies across nodes:

# On each cluster node:
sudo cat /var/lib/rabbitmq/.erlang.cookie
# Manually sync if different
sudo service rabbitmq-server restart

Verify critical directory permissions:

sudo ls -la /var/lib/rabbitmq/
sudo chown -R rabbitmq:rabbitmq /var/lib/rabbitmq
sudo chmod 600 /var/lib/rabbitmq/.erlang.cookie

Bypass init systems and start directly:

sudo -u rabbitmq /usr/lib/rabbitmq/bin/rabbitmq-server -detached
sudo rabbitmqctl await_startup

For EPMD-related issues specifically:

sudo epmd -kill
sudo service rabbitmq-server start
epmd -names

When working with RabbitMQ clusters, you might encounter the frustrating error where epmd (Erlang Port Mapper Daemon) reports that your RabbitMQ node isn't running, even after attempting to start the service. The error typically looks like this:

Status of node 'rabbit@hostname' ...
Error: unable to connect to node 'rabbit@hostname': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@hostname']

rabbit@hostname:
* connected to epmd (port 4369) on hostname
* epmd reports: node 'rabbit' not running at all
              no other nodes on hostname
* suggestion: start the node

Before diving deep into troubleshooting, let's verify some basic components:

# Check if epmd is running
ps aux | grep epmd

# Check RabbitMQ service status
sudo systemctl status rabbitmq-server

# Check open ports (4369 should be open for epmd)
sudo netstat -tulnp | grep 4369

Several factors could lead to this situation:

  • Cookie mismatch between nodes
  • Incorrect hostname resolution
  • Permission issues with Mnesia database
  • Leftover files from previous installations

Here's a step-by-step method to resolve the issue:

# 1. Stop RabbitMQ completely
sudo systemctl stop rabbitmq-server

# 2. Kill any remaining Erlang processes
sudo pkill -9 beam
sudo pkill -9 epmd

# 3. Remove old Mnesia database
sudo rm -rf /var/lib/rabbitmq/mnesia/

# 4. Verify cookie consistency
sudo cat /var/lib/rabbitmq/.erlang.cookie
# Compare with:
sudo cat /home/ubuntu/.erlang.cookie

# 5. Clean up any leftover files
sudo rm -f /var/log/rabbitmq/*
sudo rm -f /etc/rabbitmq/conf.d/*

# 6. Restart the service
sudo systemctl start rabbitmq-server

If the basic approach doesn't work, try these advanced methods:

# Start RabbitMQ in console mode for detailed logs
sudo -u rabbitmq rabbitmq-server -detached

# Check cluster status
sudo rabbitmqctl cluster_status

# Force node cleanup
sudo rabbitmqctl forget_cluster_node rabbit@hostname

Ensure your /etc/rabbitmq/rabbitmq.conf contains proper settings:

# Sample configuration
listeners.tcp.default = 5672
management.tcp.port = 15672
loopback_users = none
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit@hostname

For production environments, consider this recovery script:

#!/bin/bash

# RabbitMQ recovery script
RABBIT_LOG="/var/log/rabbitmq/startup.log"
DATE=$(date +"%Y-%m-%d %T")

echo "[$DATE] Starting recovery procedure" >> $RABBIT_LOG

# Check running status
if systemctl is-active --quiet rabbitmq-server; then
    echo "[$DATE] Service is running but not responding" >> $RABBIT_LOG
    systemctl stop rabbitmq-server
fi

# Cleanup procedures
pkill -9 -f "beam|epmd"
rm -rf /var/lib/rabbitmq/mnesia/*
rm -f /var/lib/rabbitmq/.erlang.cookie.lock

# Verify hostname resolution
HOSTNAME=$(hostname -s)
if ! grep -q "$HOSTNAME" /etc/hosts; then
    echo "127.0.0.1 $HOSTNAME" >> /etc/hosts
fi

# Restart service
systemctl start rabbitmq-server >> $RABBIT_LOG 2>&1

# Verify status
sleep 5
if rabbitmqctl node_health_check; then
    echo "[$DATE] Recovery successful" >> $RABBIT_LOG
else
    echo "[$DATE] Recovery failed" >> $RABBIT_LOG
fi
  • Implement proper monitoring for RabbitMQ nodes
  • Regularly backup your configuration and Mnesia data
  • Use configuration management tools for consistent setups
  • Document your cluster architecture thoroughly