When implementing STONITH (Shoot The Other Node In The Head) in a 2-node Pacemaker/Corosync cluster, split-brain scenarios become particularly problematic. The fundamental issue arises when quorum is lost due to network partitioning between the two nodes - both nodes might attempt to fence each other simultaneously.
For a robust 2-node PostgreSQL HA setup with DRBD, we need several critical configurations:
# Basic cluster properties
crm configure property stonith-enabled=true
crm configure property stonith-action=poweroff
crm configure property no-quorum-policy=ignore
crm configure rsc_defaults resource-stickiness=100
While SSH fencing might work for testing, production environments should use more reliable fencing devices. Here are better alternatives:
# Example using IPMI (recommended for production)
crm configure primitive stonith-ipmi stonith:fence_ipmilan \
params pcmk_host_list="node1 node2" \
ipaddr="10.10.10.251,10.10.10.252" \
login="admin" passwd="secret" \
op monitor interval="60s"
# For virtualized environments (KVM example)
crm configure primitive stonith-virsh stonith:fence_virsh \
params pcmk_host_list="node1 node2" \
ipaddr="10.10.10.251,10.10.10.252" \
login="root" \
op monitor interval="60s"
The core solution to prevent mutual fencing during network partitions involves two key adjustments:
# 1. Configure asymmetric fencing timeouts
crm configure property stonith-timeout=30s
crm configure property stonith-watchdog-timeout=60s
# 2. Implement priority-based fencing
crm configure location loc-fencing-priority stonith-ipmi \
rule -inf: not_defined node-priority
crm configure location loc-fencing-priority stonith-ipmi \
rule -inf: node-priority lte 0
For environments where adding a physical third node isn't feasible, consider a quorum device:
# On both nodes:
yum install corosync-qdevice
pcs qdevice setup model net --enable --start
# Configure the qdevice (run on one node)
pcs cluster quorum expected-votes 2
pcs quorum device add model net host=192.168.1.100 algorithm=ffsplit
After implementing these changes, test your fencing configuration thoroughly:
# Test fencing from node1 (should only fence node2)
crm_attribute --type nodes --node node1 --name node-priority --update 100
crm_attribute --type nodes --node node2 --name node-priority --update 1
# Manually induce a fencing scenario to verify
stonith_admin --reboot node2
When implementing high availability for PostgreSQL using Pacemaker/Corosync on a two-node cluster, STONITH (Shoot The Other Node In The Head) configuration presents unique challenges. The fundamental issue arises from quorum calculations in split-brain scenarios where communication fails between nodes.
In your setup with dedicated HA interfaces (10.10.10.X) and service interfaces (172.10.10.X), the cluster needs to properly handle network partitions. The current configuration:
eth0 eth1 host
10.10.10.251 172.10.10.1 node1
10.10.10.252 172.10.10.2 node2
Your SSH-based STONITH configuration appears correct at first glance:
crm configure property stonith-enabled=true
crm configure property stonith-action=poweroff
crm configure rsc_defaults resource-stickiness=100
crm configure property no-quorum-policy=ignore
crm configure primitive stonith_postgres stonith:external/ssh \
params hostlist="node1 node2"
crm configure clone fencing_postgres stonith_postgres
However, the critical issue emerges when the eth0 connection drops - both nodes attempt to fence each other due to quorum loss.
Option 1: Delay-Based Fencing
Implement asymmetric fencing timeouts to prevent mutual destruction:
crm configure property stonith-timeout=60s
crm configure property stonith-watchdog-timeout=30s
crm configure primitive st-node1 stonith:external/ssh \
params hostlist="node1" \
op monitor interval="60s" timeout="30s" \
meta priority="100"
crm configure primitive st-node2 stonith:external/ssh \
params hostlist="node2" \
op monitor interval="60s" timeout="30s" \
meta priority="50"
Option 2: Using a Fencing Topology
Define explicit fencing relationships:
crm configure fencing_topology \
node1 st-node2 \
node2 st-node1
crm configure property concurrent-fencing=true
Option 3: Quorum Tiebreaker
While adding a third node is ideal, you can simulate quorum with a qdevice:
# On both nodes:
yum install corosync-qdevice
corosync-qdevice-net -s
# Configure qdevice in corosync.conf
quorum {
provider: corosync_votequorum
expected_votes: 3
two_node: 1
}
quorum {
device {
votes: 1
model: net
net {
host: qdevice-host
}
}
}
After implementing any of these solutions, test the fencing behavior:
# Simulate network partition
ifdown eth0
# Verify cluster status
crm_mon -1
stonith_admin --list
stonith_admin --reboot node1 --force
Remember to test in a controlled environment and ensure you have out-of-band access to both nodes in case fencing fails.