Troubleshooting Slurm Node Daemon Startup Failure: PID File Creation Error and Systemd Timeout


7 views

When attempting to start the Slurm node daemon with systemctl start slurmd.service, the service fails with a timeout error. The key diagnostic messages reveal:

Mar 23 17:13:43 fedora1 systemd[1]: slurmd.service: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
Mar 23 17:15:11 fedora1 systemd[1]: slurmd.service: Start operation timed out.

The immediate issue appears to be that the slurmd process cannot create its PID file in /var/run/slurm/. Let's verify the directory structure and permissions:

# Check if directory exists
ls -ld /var/run/slurm

# If missing, create with correct permissions
sudo mkdir -p /var/run/slurm
sudo chown slurm:slurm /var/run/slurm
sudo chmod 755 /var/run/slurm

The default Slurm systemd unit file may need adjustment. Create or modify /usr/lib/systemd/system/slurmd.service:

[Unit]
Description=Slurm node daemon
After=network.target munge.service

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
PIDFile=/var/run/slurm/slurmd.pid
TimeoutSec=120
Restart=on-failure

[Install]
WantedBy=multi-user.target

To get more detailed error information, start the daemon manually in debug mode:

sudo -u slurm /usr/sbin/slurmd -D -vvvv

Common issues this reveals include:

  • Missing or incorrect munge key
  • Network connectivity problems between nodes
  • Incorrect slurm.conf parameters

Validate your slurm.conf settings with:

slurmd -C

Pay special attention to:

  • ControlMachine matches the hostname
  • SlurmUser exists in the system
  • Network addresses are correct

For comprehensive logs, use:

journalctl -u slurmd -b --no-pager

Filter for specific errors with:

journalctl -u slurmd -b -p err

If issues persist, try starting slurmd directly without systemd:

sudo -u slurm /usr/sbin/slurmd

Then check if the PID file gets created:

ls -l /var/run/slurm/slurmd.pid

When attempting to start the SLURM node daemon using systemctl start slurmd.service, the system reports a critical failure:

Job for slurmd.service failed because a timeout was exceeded.
Mar 23 17:13:43 fedora1 systemd[1]: slurmd.service: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory

The immediate issue stems from the daemon's inability to create or access its PID file. Let's verify the directory structure:

# Check directory existence and permissions
ls -ld /var/run/slurm
stat /var/run/slurm

# Expected output should show:
# drwxr-xr-x 2 slurm slurm 4096 Mar 23 17:13 /var/run/slurm

Key parameters in slurm.conf that affect this behavior:

SlurmUser=slurm
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd

Cross-validate these with actual system paths:

# Verify user existence
getent passwd slurm
id slurm

# Check spool directory
ls -ld /var/spool/slurmd

The complete startup flow should follow this pattern:

# Create necessary directories
sudo mkdir -p /var/run/slurm /var/spool/slurmd
sudo chown slurm:slurm /var/run/slurm /var/spool/slurmd

# Set correct SELinux context if applicable
sudo semanage fcontext -a -t slurm_var_run_t "/var/run/slurm(/.*)?"
sudo restorecon -Rv /var/run/slurm

# Reload systemd and restart service
sudo systemctl daemon-reload
sudo systemctl start slurmd

For deeper investigation, enable verbose logging:

# Temporarily modify systemd unit
sudo systemctl edit --full slurmd.service

# Add these parameters under [Service]:
Environment="SLURM_DEBUG=5"
StandardOutput=journal
StandardError=journal

# Then monitor logs with:
journalctl -u slurmd -f

Verify network connectivity between nodes:

# From control node to compute node
ping -c 3 fedora2
nc -zv fedora2 6818

# From compute node to control node
ping -c 3 fedora1
nc -zv fedora1 6817

Always validate your SLURM configuration before restarting services:

# Check config syntax
slurmd -C
slurmctld -C

# Verify running version matches config
slurmd -V
cat /etc/slurm/slurm.conf | grep -i ^SlurmctldVersion