When executing remote commands via SSH (particularly through Nagios' check_by_ssh module), connections intermittently hang during command execution. The issue manifests with these characteristics:
- Hangs occur randomly during non-interactive command execution (e.g., 'ls')
- Authentication succeeds but command execution stalls
- Affects all clients from the same IP address
- Works temporarily after killing sshd processes
- Problem disappears when using -t/-T flags
# Typical failing command
ssh -l root -p 2222 server.domain.tld 'ls'
# Debug output shows hang point:
debug2: channel 0: request exec confirm 1
The hanging occurs due to SSH's handling of pseudo-terminal (PTY) allocation in non-interactive sessions. Key factors:
- PTY contention: Server-side PTY resource exhaustion
- Session cleanup: Zombie processes holding PTY resources
- Nagios specific: check_by_ssh's default behavior doesn't include terminal flags
Server-Side Configuration
# /etc/ssh/sshd_config modifications:
ClientAliveInterval 60
ClientAliveCountMax 3
MaxSessions 100
MaxStartups 100:30:300
# Session cleanup cron job
*/5 * * * * /usr/bin/pkill -9 -f 'ssh: root@notty'
Client-Side Workarounds
Option 1: Force non-interactive mode (preferred for Nagios)
ssh -o "RequestTTY=no" -p 2222 root@server.domain.tld 'ls'
Option 2: Nagios-specific configuration
# In commands.cfg
define command {
command_name check_by_ssh_fixed
command_line /usr/lib/nagios/plugins/check_by_ssh \
-H $HOSTADDRESS$ -C "$ARG1$" \
-o "RequestTTY=no" -o "StrictHostKeyChecking=no"
}
When standard fixes don't work, examine these areas:
- Packet capture:
tcpdump -i eth0 'port 2222' -w ssh.pcap
- SSH debug mode:
ssh -vvv root@server
- Server resource limits: Check
ulimit -a
and/proc/sys/fs/nr_open
- Implement connection pooling for frequent SSH commands
- Use SSH multiplexing with ControlMaster
- Schedule regular service restarts during maintenance windows
- Monitor PTY allocation with
ls /dev/pts | wc -l
When executing remote commands via SSH (particularly through Nagios' check_by_ssh module), we encounter random hangs during command execution. Authentication succeeds, interactive login works, but simple commands like ls
freeze at the execution phase:
ssh -l root -p 2222 server.domain.tld 'ls'
The client debug log shows the session establishes successfully but hangs after sending the command:
debug1: Entering interactive session.
debug2: callback start
debug2: client_session2_setup: id 0
debug1: Sending environment.
debug3: Ignored env ORBIT_SOCKETDIR
*** skipping approx 40 env var ignored
debug1: Sending command: ls
debug2: channel 0: request exec confirm 1
Interestingly, forcing pseudo-terminal allocation with -t
or -T
resolves the issue:
ssh -t -l root -p 2222 server.domain.tld 'ls'
But this isn't always feasible, especially in automated systems like Nagios.
- SSH Server Resource Constraints: Check for process limits or memory exhaustion
- Network Middlebox Interference: Some network devices may improperly handle SSH traffic
- TCP Window Scaling Issues: Particularly in high-latency networks
- DNS Resolution Delays: Even with reverse DNS disabled, some systems still attempt lookups
Server-side Configuration
Add these directives to /etc/ssh/sshd_config
:
UseDNS no
ClientAliveInterval 30
ClientAliveCountMax 3
MaxSessions 50
MaxStartups 50:30:100
Client-side Tweaks
Create/modify ~/.ssh/config
:
Host *
ServerAliveInterval 15
ServerAliveCountMax 3
TCPKeepAlive yes
ControlMaster auto
ControlPath ~/.ssh/control:%h:%p:%r
ControlPersist 1h
Nagios-specific Fix
For check_by_ssh, modify your command definition:
define command {
command_name check_by_ssh_fixed
command_line /usr/lib/nagios/plugins/check_by_ssh -H $HOSTADDRESS$ -C "$ARG1$" -t 30 -o "ServerAliveInterval=15"
}
When the issue occurs:
- Check server resource usage:
top -c
andss -tulnp | grep sshd
- Capture network traffic:
tcpdump -i eth0 'port 2222' -w ssh-capture.pcap
- Inspect open files:
lsof -p $(pgrep sshd)
For critical systems where SSH reliability is paramount:
#!/bin/bash
# Robust SSH command executor with retries
MAX_RETRIES=3
TIMEOUT=30
execute_remote() {
for i in $(seq 1 $MAX_RETRIES); do
if timeout $TIMEOUT ssh -o "ConnectTimeout=5" "$@"; then
return 0
fi
sleep $((i * 2))
done
return 1
}
execute_remote -p 2222 root@server.domain.tld 'ls'