When you encounter the epmd reports: node 'rabbit' not running at all
error in RabbitMQ, it typically indicates a fundamental communication breakdown between the RabbitMQ node and the Erlang Port Mapper Daemon (EPMD). This often happens even after apparently successful service restarts.
First, let's verify the actual process status beyond the service commands:
ps aux | grep beam
sudo netstat -tulnp | grep 4369
sudo rabbitmqctl status
The most frequent culprits include:
- Cookie mismatches between nodes
- EPMD port conflicts
- Incomplete previous node shutdown
- File permission issues in /var/lib/rabbitmq
Try this comprehensive reset sequence:
sudo service rabbitmq-server stop
sudo pkill -f rabbitmq
sudo pkill -f epmd
sudo rm -rf /var/lib/rabbitmq/mnesia/
sudo rm /var/lib/rabbitmq/.erlang.cookie
sudo service rabbitmq-server start
sudo rabbitmqctl wait /var/lib/rabbitmq/mnesia/rabbit@$(hostname).pid
If the issue persists, debug the Erlang node startup:
RABBITMQ_LOG_BASE=/tmp/rabbitmq_logs \
RABBITMQ_NODENAME=debug@localhost \
sudo -u rabbitmq rabbitmq-server -detached
Then check the generated logs in /tmp/rabbitmq_logs/
For clustered environments, ensure consistent cookies across nodes:
# On each cluster node:
sudo cat /var/lib/rabbitmq/.erlang.cookie
# Manually sync if different
sudo service rabbitmq-server restart
Verify critical directory permissions:
sudo ls -la /var/lib/rabbitmq/
sudo chown -R rabbitmq:rabbitmq /var/lib/rabbitmq
sudo chmod 600 /var/lib/rabbitmq/.erlang.cookie
Bypass init systems and start directly:
sudo -u rabbitmq /usr/lib/rabbitmq/bin/rabbitmq-server -detached
sudo rabbitmqctl await_startup
For EPMD-related issues specifically:
sudo epmd -kill
sudo service rabbitmq-server start
epmd -names
When working with RabbitMQ clusters, you might encounter the frustrating error where epmd (Erlang Port Mapper Daemon) reports that your RabbitMQ node isn't running, even after attempting to start the service. The error typically looks like this:
Status of node 'rabbit@hostname' ...
Error: unable to connect to node 'rabbit@hostname': nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit@hostname']
rabbit@hostname:
* connected to epmd (port 4369) on hostname
* epmd reports: node 'rabbit' not running at all
no other nodes on hostname
* suggestion: start the node
Before diving deep into troubleshooting, let's verify some basic components:
# Check if epmd is running
ps aux | grep epmd
# Check RabbitMQ service status
sudo systemctl status rabbitmq-server
# Check open ports (4369 should be open for epmd)
sudo netstat -tulnp | grep 4369
Several factors could lead to this situation:
- Cookie mismatch between nodes
- Incorrect hostname resolution
- Permission issues with Mnesia database
- Leftover files from previous installations
Here's a step-by-step method to resolve the issue:
# 1. Stop RabbitMQ completely
sudo systemctl stop rabbitmq-server
# 2. Kill any remaining Erlang processes
sudo pkill -9 beam
sudo pkill -9 epmd
# 3. Remove old Mnesia database
sudo rm -rf /var/lib/rabbitmq/mnesia/
# 4. Verify cookie consistency
sudo cat /var/lib/rabbitmq/.erlang.cookie
# Compare with:
sudo cat /home/ubuntu/.erlang.cookie
# 5. Clean up any leftover files
sudo rm -f /var/log/rabbitmq/*
sudo rm -f /etc/rabbitmq/conf.d/*
# 6. Restart the service
sudo systemctl start rabbitmq-server
If the basic approach doesn't work, try these advanced methods:
# Start RabbitMQ in console mode for detailed logs
sudo -u rabbitmq rabbitmq-server -detached
# Check cluster status
sudo rabbitmqctl cluster_status
# Force node cleanup
sudo rabbitmqctl forget_cluster_node rabbit@hostname
Ensure your /etc/rabbitmq/rabbitmq.conf
contains proper settings:
# Sample configuration
listeners.tcp.default = 5672
management.tcp.port = 15672
loopback_users = none
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit@hostname
For production environments, consider this recovery script:
#!/bin/bash
# RabbitMQ recovery script
RABBIT_LOG="/var/log/rabbitmq/startup.log"
DATE=$(date +"%Y-%m-%d %T")
echo "[$DATE] Starting recovery procedure" >> $RABBIT_LOG
# Check running status
if systemctl is-active --quiet rabbitmq-server; then
echo "[$DATE] Service is running but not responding" >> $RABBIT_LOG
systemctl stop rabbitmq-server
fi
# Cleanup procedures
pkill -9 -f "beam|epmd"
rm -rf /var/lib/rabbitmq/mnesia/*
rm -f /var/lib/rabbitmq/.erlang.cookie.lock
# Verify hostname resolution
HOSTNAME=$(hostname -s)
if ! grep -q "$HOSTNAME" /etc/hosts; then
echo "127.0.0.1 $HOSTNAME" >> /etc/hosts
fi
# Restart service
systemctl start rabbitmq-server >> $RABBIT_LOG 2>&1
# Verify status
sleep 5
if rabbitmqctl node_health_check; then
echo "[$DATE] Recovery successful" >> $RABBIT_LOG
else
echo "[$DATE] Recovery failed" >> $RABBIT_LOG
fi
- Implement proper monitoring for RabbitMQ nodes
- Regularly backup your configuration and Mnesia data
- Use configuration management tools for consistent setups
- Document your cluster architecture thoroughly