Resolving DRBD Connection Issues: Fixing WFConnection and StandAlone States in a 2-Node Cluster

When your DRBD cluster shows either WFConnection (Waiting for Connection) or StandAlone states, it indicates a fundamental communication breakdown between nodes. The key indicators in your /proc/drbd output reveal:

// Node 1 (Primary)
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
   ns:0 nr:0 dw:0 dr:912 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:20

// Node 2 (Secondary)  
1: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:48

Before diving deep, verify basic network connectivity:

# On both nodes:
ping <peer-ip>
nc -zv <peer-ip> <drbd-port>
ss -tulnp | grep drbd
iptables -L -n | grep <drbd-port>

Common pitfalls include:

Firewall rules blocking the DRBD port (default 7789)
Network interfaces not being properly initialized
Incorrect IP addresses in /etc/drbd.d/r1.res

1. Reset DRBD States

First, completely tear down the resource on both nodes:

drbdadm down r1
rmmod drbd
modprobe drbd
drbdadm up r1

2. Force Primary Designation

On what should be your primary node:

drbdadm primary r1 --force

3. Establish Connection

On the secondary node, watch the connection attempt:

drbdadm connect r1
watch -n1 cat /proc/drbd

If still failing, enable detailed logging:

echo 1 > /proc/sys/drbd/debug_level
tail -f /var/log/messages | grep drbd

Common log patterns to watch for:

# Connection timeout:
"Handshake unsuccessful"

# Authentication failures:
"Packet authentication failed"

# Network issues:
"sendmsg failed with 111"

Double-check your /etc/drbd.d/r1.res:

resource r1 {
  protocol C;
  startup {
    become-primary-on both;
  }
  net {
    cram-hmac-alg "sha1";
    shared-secret "your-secret";
    after-sb-0pri discard-zero-changes;
  }
  on node1 {
    device /dev/drbd1;
    disk /dev/sdb1;
    address 192.168.1.10:7789;
    meta-disk internal;
  }
  on node2 {
    device /dev/drbd1;
    disk /dev/sdb1;
    address 192.168.1.11:7789;
    meta-disk internal;
  }
}

For persistent issues, consider packet capture:

tcpdump -i eth0 port 7789 -w drbd.pcap
# Analyze with Wireshark or:
tcpdump -r drbd.pcap -n

Key things to verify in packet captures:

Are DRBD handshake packets being exchanged?
Is there any TCP retransmission?
Are packets being dropped at either end?

When your DRBD cluster gets stuck in WFConnection (Waiting For Connection) or StandAlone states, it indicates a fundamental communication breakdown between nodes. The key indicators in your /proc/drbd output show:

Primary node: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown
Secondary node: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown

Before attempting fixes, verify these critical points:

# Check network connectivity between nodes
ping <partner_ip>
nc -zv <partner_ip> <drbd_port>

# Verify DRBD config consistency
diff /etc/drbd.d/r1.res /etc/drbd.d/r1.res

# Check for kernel module issues
lsmod | grep drbd
dmesg | grep -i drbd

1. Force Disconnect and Reconnect

First attempt a clean restart sequence:

# On both nodes:
drbdadm disconnect r1
drbdadm down r1
modprobe -r drbd
modprobe drbd
drbdadm up r1

2. Manual Connection Establishment

If automatic connection fails, manually initiate:

# On primary node:
drbdadm primary r1 --force
drbdadm connect r1 --discard-my-data

# On secondary node:
drbdadm connect r1

When standard recovery fails, enable detailed logging:

# Increase DRBD debug level
echo 7 > /proc/sys/debug/drbd

# Monitor connection attempts in real-time
watch -n 1 'cat /proc/drbd; drbd-overview'

Ensure your resource configuration contains these critical elements:

resource r1 {
  protocol C;
  startup {
    wfc-timeout 30;
    outdated-wfc-timeout 20;
  }
  net {
    cram-hmac-alg "sha1";
    shared-secret "your-secret";
    after-sb-0pri discard-zero-changes;
  }
  on node1 {
    address 10.0.0.1:7788;
    device /dev/drbd1;
    disk /dev/sda1;
    meta-disk internal;
  }
  on node2 {
    address 10.0.0.2:7788;
    device /dev/drbd1;
    disk /dev/sda1;
    meta-disk internal;
  }
}

Network-level blocks often cause persistent WFConnection states:

# For firewalld (RHEL/CentOS)
firewall-cmd --add-port=7788/tcp --permanent
firewall-cmd --reload

# For iptables (Debian/Ubuntu)
iptables -A INPUT -p tcp --dport 7788 -j ACCEPT
iptables-save > /etc/iptables/rules.v4

After successful reconnection, verify proper sync status:

drbdadm status r1
cat /proc/drbd

# Expected healthy output:
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

ServerDevWorker