Resolving DRBD Split-Brain and Secondary Node Transition Issues with OCFS2 in Linux


2 views

When working with DRBD 8.3.13 on CentOS 5 in an OCFS2 cluster configuration, you may encounter situations where DRBD enters split-brain state and refuses to transition to secondary mode. The key error message appears as:

1: State change failed: (-12) Device is held open by someone
Command 'drbdsetup 1 secondary' terminated with exit code 11

First verify the current DRBD status:

# cat /proc/drbd
version: 8.3.13 (api:88/proto:86-96)
1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r-----
   ns:0 nr:0 dw:112281991 dr:797551 al:99 bm:6401 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:60

Then check for processes holding DRBD resources:

# lsof | grep drbd
# ps aux | grep drbd

OCFS2 can maintain persistent connections to storage devices. Verify its status:

# service ocfs2 status
# mount | grep ocfs2

Even if OCFS2 appears unmounted, check for lingering processes:

# ps aux | grep o2hb
# ls -l /proc/$(pidof o2hb-*)/exe

When encountering zombie processes with square brackets in ps output:

root      7782     1  0 Apr22 ?        00:00:20 [drbd1_worker]

This indicates a kernel thread or defunct process. Examine its stack trace:

# echo t > /proc/sysrq-trigger
# dmesg | grep -A20 "drbd1_worker"

Verify LVM's involvement with DRBD devices:

# vgdisplay -v
# lvdisplay -m
# dmsetup ls --tree -o inverted

1. Forcefully terminate any OCFS2-related processes:

# killall -9 o2hb-*
# killall -9 ocfs2*

2. Attempt DRBD detach:

# drbdadm detach r0

3. Cleanup DRBD metadata:

# drbdadm -- --discard-my-data connect r0

After successful recovery, verify the new state:

# drbdadm connect r0
# drbdadm secondary r0
# cat /proc/drbd

For persistent solutions, consider adding these to your DRBD configuration:

net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
}

When running DRBD 8.3.13 with OCFS2 on CentOS 5, you may encounter a situation where:

1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r-----
   ns:0 nr:0 dw:112281991 dr:797551 al:99 bm:6401 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:60

The most frustrating part appears when trying to switch to secondary:

drbdadm secondary r0
1: State change failed: (-12) Device is held open by someone
Command 'drbdsetup 1 secondary' terminated with exit code 11

Our first clue comes from process examination:

# lsof | grep drbd
drbd1_wor  7782      root  cwd       DIR              253,0     4096          2 /

Note the zombie process (indicated by square brackets):

root      7782     1  0 Apr22 ?        00:00:20 [drbd1_worker]

Checking the storage topology reveals:

# dmsetup ls --tree -o inverted
 (202:2)
 ├─VolGroup00-LogVol01 (253:1)
 └─VolGroup00-LogVol00 (253:0)

Let's examine the kernel-level interactions:

kernel: drbd1_worker  S ffff81007ae21820     0  7782      1          7795  7038 (L-TLB)
kernel:  ffff810055d89e00 0000000000000046 000573a8befba2d6 ffffffff8008e82f 
kernel:  [] :drbd:.text.lock.drbd_worker+0x2d/0x43

Here's how to resolve this without rebooting:

  1. First, ensure OCFS2 is completely unmounted:
    umount -f /data
  2. Terminate any orphaned processes:
    kill -9 7782
  3. Clear any kernel references:
    echo 1 > /sys/block/drbd1/device/delete
  4. Finally, switch to secondary:
    drbdadm secondary r0

If the above fails, try forcing the secondary state:

drbdsetup /dev/drbd1 secondary --force

Modify your DRBD configuration to prevent future occurrences:

resource r0 {
    net {
        # Add these parameters
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;
    }
}