Resolving DRBD Connection Issues: Fixing WFConnection and StandAlone States in a 2-Node Cluster


5 views

When your DRBD cluster shows either WFConnection (Waiting for Connection) or StandAlone states, it indicates a fundamental communication breakdown between nodes. The key indicators in your /proc/drbd output reveal:

// Node 1 (Primary)
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
   ns:0 nr:0 dw:0 dr:912 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:20

// Node 2 (Secondary)  
1: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:48

Before diving deep, verify basic network connectivity:

# On both nodes:
ping <peer-ip>
nc -zv <peer-ip> <drbd-port>
ss -tulnp | grep drbd
iptables -L -n | grep <drbd-port>

Common pitfalls include:

  • Firewall rules blocking the DRBD port (default 7789)
  • Network interfaces not being properly initialized
  • Incorrect IP addresses in /etc/drbd.d/r1.res

1. Reset DRBD States

First, completely tear down the resource on both nodes:

drbdadm down r1
rmmod drbd
modprobe drbd
drbdadm up r1

2. Force Primary Designation

On what should be your primary node:

drbdadm primary r1 --force

3. Establish Connection

On the secondary node, watch the connection attempt:

drbdadm connect r1
watch -n1 cat /proc/drbd

If still failing, enable detailed logging:

echo 1 > /proc/sys/drbd/debug_level
tail -f /var/log/messages | grep drbd

Common log patterns to watch for:

# Connection timeout:
"Handshake unsuccessful"

# Authentication failures:
"Packet authentication failed"

# Network issues:
"sendmsg failed with 111"

Double-check your /etc/drbd.d/r1.res:

resource r1 {
  protocol C;
  startup {
    become-primary-on both;
  }
  net {
    cram-hmac-alg "sha1";
    shared-secret "your-secret";
    after-sb-0pri discard-zero-changes;
  }
  on node1 {
    device /dev/drbd1;
    disk /dev/sdb1;
    address 192.168.1.10:7789;
    meta-disk internal;
  }
  on node2 {
    device /dev/drbd1;
    disk /dev/sdb1;
    address 192.168.1.11:7789;
    meta-disk internal;
  }
}

For persistent issues, consider packet capture:

tcpdump -i eth0 port 7789 -w drbd.pcap
# Analyze with Wireshark or:
tcpdump -r drbd.pcap -n

Key things to verify in packet captures:

  • Are DRBD handshake packets being exchanged?
  • Is there any TCP retransmission?
  • Are packets being dropped at either end?

When your DRBD cluster gets stuck in WFConnection (Waiting For Connection) or StandAlone states, it indicates a fundamental communication breakdown between nodes. The key indicators in your /proc/drbd output show:

Primary node: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown
Secondary node: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown

Before attempting fixes, verify these critical points:

# Check network connectivity between nodes
ping <partner_ip>
nc -zv <partner_ip> <drbd_port>

# Verify DRBD config consistency
diff /etc/drbd.d/r1.res /etc/drbd.d/r1.res

# Check for kernel module issues
lsmod | grep drbd
dmesg | grep -i drbd

1. Force Disconnect and Reconnect

First attempt a clean restart sequence:

# On both nodes:
drbdadm disconnect r1
drbdadm down r1
modprobe -r drbd
modprobe drbd
drbdadm up r1

2. Manual Connection Establishment

If automatic connection fails, manually initiate:

# On primary node:
drbdadm primary r1 --force
drbdadm connect r1 --discard-my-data

# On secondary node:
drbdadm connect r1

When standard recovery fails, enable detailed logging:

# Increase DRBD debug level
echo 7 > /proc/sys/debug/drbd

# Monitor connection attempts in real-time
watch -n 1 'cat /proc/drbd; drbd-overview'

Ensure your resource configuration contains these critical elements:

resource r1 {
  protocol C;
  startup {
    wfc-timeout 30;
    outdated-wfc-timeout 20;
  }
  net {
    cram-hmac-alg "sha1";
    shared-secret "your-secret";
    after-sb-0pri discard-zero-changes;
  }
  on node1 {
    address 10.0.0.1:7788;
    device /dev/drbd1;
    disk /dev/sda1;
    meta-disk internal;
  }
  on node2 {
    address 10.0.0.2:7788;
    device /dev/drbd1;
    disk /dev/sda1;
    meta-disk internal;
  }
}

Network-level blocks often cause persistent WFConnection states:

# For firewalld (RHEL/CentOS)
firewall-cmd --add-port=7788/tcp --permanent
firewall-cmd --reload

# For iptables (Debian/Ubuntu)
iptables -A INPUT -p tcp --dport 7788 -j ACCEPT
iptables-save > /etc/iptables/rules.v4

After successful reconnection, verify proper sync status:

drbdadm status r1
cat /proc/drbd

# Expected healthy output:
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----