Troubleshooting DRBD Sync Issues: EC2 Nodes Stuck in StandAlone State with Bind Errors


12 views

When examining the DRBD status on both nodes, we observe both instances are in StandAlone state despite configuration attempts. The primary node shows:

m:res  cs          ro               ds                 p       mounted  fstype
0:r0   StandAlone  Primary/Unknown  UpToDate/DUnknown  r----s  ext3

While the secondary reports:

m:res  cs          ro                 ds                     p       mounted  fstype
0:r0   StandAlone  Secondary/Unknown  Inconsistent/DUnknown  r----s

The kernel messages reveal a fundamental network connectivity issue:

[2285173.099330] block drbd0: bind before connect failed, err = -99
[2285173.099346] block drbd0: conn( WFConnection -> Disconnecting )

The error code -99 (EADDRNOTAVAIL) indicates the network stack cannot bind to the specified IP addresses.

The fundamental issue stems from EC2's elastic IP implementation. While the DRBD config specifies public IPs:

on drbd01 {
    address 23.XX.XX.XX:7788;
}
on drbd02 {
    address 184.XX.XX.XX:7788;
}

The actual interfaces show private IP assignments:

# Primary node
eth0: inet addr:10.28.39.17

# Secondary node
eth0: inet addr:10.160.27.107

For DRBD to work properly on EC2, we must use the private IPs and configure security groups:

resource r0 {
    protocol C;
    startup {
        wfc-timeout  15;
        degr-wfc-timeout 60;
    }
    net {
        cram-hmac-alg sha1;
        shared-secret "test123";
    }
    on drbd01 {
        device /dev/drbd0;
        disk /dev/xvdm;
        address 10.28.39.17:7788;
        meta-disk internal;
    }
    on drbd02 {
        device /dev/drbd0;
        disk /dev/xvdm;
        address 10.160.27.107:7788;
        meta-disk internal;
    }
}

After correcting the IP addresses:

  1. Reload the configuration on both nodes:
    drbdadm adjust r0
  2. Initialize the synchronization:
    drbdadm -- --overwrite-data-of-peer primary r0
  3. Verify connectivity:
    drbdadm status r0

Ensure your EC2 security groups allow bidirectional traffic on port 7788 between the nodes. A sample security group rule:

Type: Custom TCP Rule
Protocol: TCP
Port Range: 7788
Source: [other node's security group ID]

Once connected, monitor sync progress with:

watch -n1 cat /proc/drbd

Expect to see connection state transitioning from SyncTarget to Connected, with decreasing oos (out-of-sync) blocks.

Implement these best practices:

  • Use EC2's internal DNS names for dynamic IP environments
  • Configure monitoring for DRBD connection state
  • Set up alerts for split-brain conditions
  • Regularly test failover procedures

When examining /proc/drbd on both nodes, we see:

# Primary node
0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown
ns:0 nr:0 dw:4 dr:1073 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:262135964

# Secondary node 
0: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:262135964

The critical error appears when running drbdadm dump all:

# On primary
/etc/drbd.conf:19: in resource r0, on drbd01:
    IP 23.XX.XX.XX not found on this host.

# On secondary  
/etc/drbd.conf:25: in resource r0, on drbd02:
    IP 184.XX.XX.XX not found on this host.

The actual interface configurations show private IPs (10.x.x.x) while DRBD config uses Elastic IPs:

# Primary ifconfig output
eth0: inet addr:10.28.39.17

# Secondary ifconfig output  
eth0: inet addr:10.160.27.107

Modify your DRBD configuration to use the private IPs that EC2 instances actually use for internal communication:

resource r0 {
    protocol C;
    startup {
        wfc-timeout  15;
        degr-wfc-timeout 60;
    }
    net {
        cram-hmac-alg sha1;
        shared-secret "test123";
    }
    on drbd01 {
        device /dev/drbd0;
        disk /dev/xvdm;
        address 10.28.39.17:7788; # Use private IP
        meta-disk internal;
    }
    on drbd02 {
        device /dev/drbd0;
        disk /dev/xvdm;
        address 10.160.27.107:7788; # Use private IP
        meta-disk internal;
    }
}

After configuration changes:

# On both nodes:
drbdadm adjust r0
drbdadm up r0

# On secondary node:
drbdadm secondary r0

# On primary node:
drbdadm primary --force r0

If issues persist, check:

# Network connectivity
nc -zv 10.160.27.107 7788

# Firewall rules
iptables -L -n | grep 7788

# DRBD connection status
cat /proc/drbd