Debugging SSH Command Execution Hangs: Solutions for Nagios check_by_ssh Issues


2 views

When executing remote commands via SSH (particularly through Nagios' check_by_ssh module), connections intermittently hang during command execution. The issue manifests with these characteristics:

  • Hangs occur randomly during non-interactive command execution (e.g., 'ls')
  • Authentication succeeds but command execution stalls
  • Affects all clients from the same IP address
  • Works temporarily after killing sshd processes
  • Problem disappears when using -t/-T flags
# Typical failing command
ssh -l root -p 2222 server.domain.tld 'ls'

# Debug output shows hang point:
debug2: channel 0: request exec confirm 1

The hanging occurs due to SSH's handling of pseudo-terminal (PTY) allocation in non-interactive sessions. Key factors:

  • PTY contention: Server-side PTY resource exhaustion
  • Session cleanup: Zombie processes holding PTY resources
  • Nagios specific: check_by_ssh's default behavior doesn't include terminal flags

Server-Side Configuration

# /etc/ssh/sshd_config modifications:
ClientAliveInterval 60
ClientAliveCountMax 3
MaxSessions 100
MaxStartups 100:30:300

# Session cleanup cron job
*/5 * * * * /usr/bin/pkill -9 -f 'ssh: root@notty'

Client-Side Workarounds

Option 1: Force non-interactive mode (preferred for Nagios)

ssh -o "RequestTTY=no" -p 2222 root@server.domain.tld 'ls'

Option 2: Nagios-specific configuration

# In commands.cfg
define command {
    command_name    check_by_ssh_fixed
    command_line    /usr/lib/nagios/plugins/check_by_ssh \
        -H $HOSTADDRESS$ -C "$ARG1$" \
        -o "RequestTTY=no" -o "StrictHostKeyChecking=no"
}

When standard fixes don't work, examine these areas:

  • Packet capture: tcpdump -i eth0 'port 2222' -w ssh.pcap
  • SSH debug mode: ssh -vvv root@server
  • Server resource limits: Check ulimit -a and /proc/sys/fs/nr_open
  1. Implement connection pooling for frequent SSH commands
  2. Use SSH multiplexing with ControlMaster
  3. Schedule regular service restarts during maintenance windows
  4. Monitor PTY allocation with ls /dev/pts | wc -l

When executing remote commands via SSH (particularly through Nagios' check_by_ssh module), we encounter random hangs during command execution. Authentication succeeds, interactive login works, but simple commands like ls freeze at the execution phase:

ssh -l root -p 2222 server.domain.tld 'ls'

The client debug log shows the session establishes successfully but hangs after sending the command:

debug1: Entering interactive session.
debug2: callback start
debug2: client_session2_setup: id 0
debug1: Sending environment.
debug3: Ignored env ORBIT_SOCKETDIR
*** skipping approx 40 env var ignored
debug1: Sending command: ls
debug2: channel 0: request exec confirm 1

Interestingly, forcing pseudo-terminal allocation with -t or -T resolves the issue:

ssh -t -l root -p 2222 server.domain.tld 'ls'

But this isn't always feasible, especially in automated systems like Nagios.

  • SSH Server Resource Constraints: Check for process limits or memory exhaustion
  • Network Middlebox Interference: Some network devices may improperly handle SSH traffic
  • TCP Window Scaling Issues: Particularly in high-latency networks
  • DNS Resolution Delays: Even with reverse DNS disabled, some systems still attempt lookups

Server-side Configuration

Add these directives to /etc/ssh/sshd_config:

UseDNS no
ClientAliveInterval 30
ClientAliveCountMax 3
MaxSessions 50
MaxStartups 50:30:100

Client-side Tweaks

Create/modify ~/.ssh/config:

Host *
    ServerAliveInterval 15
    ServerAliveCountMax 3
    TCPKeepAlive yes
    ControlMaster auto
    ControlPath ~/.ssh/control:%h:%p:%r
    ControlPersist 1h

Nagios-specific Fix

For check_by_ssh, modify your command definition:

define command {
    command_name check_by_ssh_fixed
    command_line /usr/lib/nagios/plugins/check_by_ssh -H $HOSTADDRESS$ -C "$ARG1$" -t 30 -o "ServerAliveInterval=15"
}

When the issue occurs:

  1. Check server resource usage: top -c and ss -tulnp | grep sshd
  2. Capture network traffic: tcpdump -i eth0 'port 2222' -w ssh-capture.pcap
  3. Inspect open files: lsof -p $(pgrep sshd)

For critical systems where SSH reliability is paramount:

#!/bin/bash
# Robust SSH command executor with retries
MAX_RETRIES=3
TIMEOUT=30

execute_remote() {
    for i in $(seq 1 $MAX_RETRIES); do
        if timeout $TIMEOUT ssh -o "ConnectTimeout=5" "$@"; then
            return 0
        fi
        sleep $((i * 2))
    done
    return 1
}

execute_remote -p 2222 root@server.domain.tld 'ls'