How to Identify the Active InfiniBand Subnet Manager Switch in a Multi-Switch Network


9 views

When working with InfiniBand networks containing multiple switches, one critical piece of infrastructure is the subnet manager (SM). The SM is responsible for initializing the fabric, assigning local identifiers (LIDs), and maintaining network topology. In many environments, this function runs on one of the switches (typically called "embedded subnet manager"), though best practices recommend running OpenSM on dedicated servers for better reliability.

The most straightforward way to identify the active SM is using the ibstat and sminfo commands from the InfiniBand diagnostic tools:

# Install required tools (Ubuntu/Debian example)
sudo apt-get install infiniband-diags

# Query subnet manager information
ibstat
sminfo

These commands will typically show you the SM's LID and port information. To get more detailed information about the SM location:

# Show detailed SM information
ibnetdiscover -s

If you have access to any servers running OpenSM, you can check their logs to see if they're active:

# Check OpenSM log location (common paths)
/var/log/opensm.log
/var/log/messages | grep opensm

For environments with switches from different vendors, here are vendor-specific commands:

Mellanox Switches

show subnet manager
show sm

Intel (formerly QLogic) Switches

ibswitches -l
smgetguid

For larger environments, you might want to script the discovery. Here's a Python example using subprocess:

import subprocess

def find_active_sm():
    try:
        output = subprocess.check_output(["ibstat"], stderr=subprocess.STDOUT)
        if b"Subnet Manager" in output:
            return parse_sm_info(output)
    except subprocess.CalledProcessError:
        pass
    return None

Once you've identified the current SM, you can begin transitioning to server-based OpenSM. First, install OpenSM on your target servers:

sudo apt-get install opensm
sudo systemctl enable opensm

Then configure the priorities to ensure proper failover:

# /etc/opensm/opensm.conf
priority 10   # Primary server
priority 5    # Secondary server
priority 0    # Switches (disable embedded SM)

When managing an InfiniBand network with multiple switches, one critical operational requirement is identifying which switch currently hosts the active subnet manager (SM). This becomes particularly important when:

  • Planning to migrate from switch-based SM to host-based OpenSM
  • Troubleshooting network partitioning issues
  • Performing maintenance on the SM host switch

The most straightforward approach uses the ibnetdiscover tool from the infiniband-diags package:

ibnetdiscover | grep -i 'sminfo.*active'
# Sample output:
# "sminfo smlid:smlid=0x3,activity=active,priority=15,state=3,guid=0x7cfe900300a05060"

This reveals the GUID of the switch running the active SM. Then cross-reference with:

ibswitches
# Output shows GUID to switch port mapping
# Switch GUID 0x7cfe900300a05060:
#     Lid 3, 36 ports, SM lid 3

If you have OpenSM installed on any host, use its diagnostic output:

opensm -d /tmp/opensm.dump
grep "SM port guid" /tmp/opensm.dump
# Returns the SM's port GUID which maps to switch base GUID

For automation scenarios, parse the SM state programmatically:

#!/bin/bash

SM_GUID=$(ibqueryerrors -s | \
    awk '/active SM/ {print $NF}' | \
    cut -d'=' -f2 | \
    sed 's/)//')

SWITCH_INFO=$(ibswitches | grep "$SM_GUID")
echo "Active SM running on switch: $SWITCH_INFO"

Before transitioning to host-based OpenSM, ensure proper failover configuration:

# Check current SM priority settings
smpquery -N sminfo

# Sample output showing priority levels:
# sminfo: smlid 0x3 activity active
#         priority 15 sm_guid 0x7cfe900300a05060

For networks using VLANs or multiple partitions, add the -P flag to specify the partition:

ibnetdiscover -P 0x7fff

Remember that SM information propagates through the network - you can run these commands from any IB-connected host, not necessarily directly connected to the SM switch.