How to Diagnose and Fix RabbitMQ Startup Failure After EC2 Instance Migration


2 views

When migrating RabbitMQ to a new EC2 instance, the most critical failure point occurs during the database initialization phase. The error log clearly shows:

starting database ...Erlang has closed
{"init terminating in do_boot",{{nocatch,{error,{cannot_start_application,rabbit,
{bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{{case_clause,{error,{timeout_waiting_for_tables[...]}}}}}}}}}}}

First, verify the Mnesia database state:

sudo ls -la /var/lib/rabbitmq/mnesia/rabbit

Check for corrupted files with:

sudo rabbitmqctl eval 'mnesia:info().'

If the database is corrupted from the instance migration, follow these steps:

# Stop RabbitMQ if running
sudo rabbitmqctl stop_app

# Force reset the node
sudo rabbitmqctl force_reset

# Alternatively, for complete cleanup:
sudo rm -rf /var/lib/rabbitmq/mnesia/*
sudo rm /var/lib/rabbitmq/.erlang.cookie

Verify your RabbitMQ configuration files:

sudo rabbitmqctl environment
sudo cat /etc/rabbitmq/rabbitmq-env.conf

Try starting RabbitMQ with debug logging:

sudo RABBITMQ_LOG_BASE=/var/log/rabbitmq \
RABBITMQ_LOGS=/var/log/rabbitmq/rabbit.log \
RABBITMQ_SASL_LOGS=/var/log/rabbitmq/rabbit-sasl.log \
rabbitmq-server -detached

After fixing RabbitMQ, configure Celery with proper reconnection logic in your Django settings:

BROKER_URL = 'amqp://guest:guest@localhost:5672//'
BROKER_CONNECTION_TIMEOUT = 30
BROKER_CONNECTION_RETRY = True
BROKER_CONNECTION_MAX_RETRIES = 100

Verify successful operation with:

sudo rabbitmqctl list_queues
sudo rabbitmqctl list_connections
celery -A proj inspect ping

When migrating RabbitMQ to a new EC2 instance, the service consistently fails during the database initialization phase, throwing this critical error:

starting database ...Erlang has closed
{"init terminating in do_boot",{{nocatch,{error,{cannot_start_application,rabbit,
{bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{{case_clause,{error,{timeout_waiting_for_tables,[...]}}}}}}}}}}}

The root cause typically stems from corrupted or incompatible Mnesia database files from the previous instance. Key indicators include:

  • Timeout errors when waiting for tables (rabbit_user, rabbit_vhost, etc.)
  • EPMD daemon running but RabbitMQ failing to start
  • Nodedown status when checking with rabbitmqctl

First, completely stop all RabbitMQ/Erlang processes:

sudo service rabbitmq-server stop
sudo pkill -f epmd
sudo pkill -f beam.smp

Then reset the database (warning: this will delete all queues/messages):

sudo rm -rf /var/lib/rabbitmq/mnesia/*
sudo rm /var/lib/rabbitmq/.erlang.cookie

Add these critical settings to /etc/rabbitmq/rabbitmq-env.conf:

NODENAME=rabbit@localhost
RABBITMQ_NODE_IP_ADDRESS=127.0.0.1
RABBITMQ_LOG_BASE=/var/log/rabbitmq
RABBITMQ_MNESIA_BASE=/var/lib/rabbitmq/mnesia

After successful restart, verify with:

sudo rabbitmqctl status
sudo rabbitmq-plugins list

For Celery, ensure your settings.py contains:

BROKER_URL = 'amqp://guest:guest@localhost:5672//'
CELERY_RESULT_BACKEND = 'amqp'

Consider implementing these practices:

  • Regular database backups using rabbitmqadmin export
  • Configuration management (Chef/Puppet/Ansible)
  • Cluster setup for high availability

If issues persist, gather detailed diagnostics:

sudo tail -n 100 /var/log/rabbitmq/rabbit*.log
sudo rabbitmqctl -n rabbit@localhost environment
sudo rabbitmqctl -n rabbit@localhost report