When attempting to start the Slurm node daemon with systemctl start slurmd.service
, the service fails with a timeout error. The key diagnostic messages reveal:
Mar 23 17:13:43 fedora1 systemd[1]: slurmd.service: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
Mar 23 17:15:11 fedora1 systemd[1]: slurmd.service: Start operation timed out.
The immediate issue appears to be that the slurmd
process cannot create its PID file in /var/run/slurm/
. Let's verify the directory structure and permissions:
# Check if directory exists
ls -ld /var/run/slurm
# If missing, create with correct permissions
sudo mkdir -p /var/run/slurm
sudo chown slurm:slurm /var/run/slurm
sudo chmod 755 /var/run/slurm
The default Slurm systemd unit file may need adjustment. Create or modify /usr/lib/systemd/system/slurmd.service
:
[Unit]
Description=Slurm node daemon
After=network.target munge.service
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
PIDFile=/var/run/slurm/slurmd.pid
TimeoutSec=120
Restart=on-failure
[Install]
WantedBy=multi-user.target
To get more detailed error information, start the daemon manually in debug mode:
sudo -u slurm /usr/sbin/slurmd -D -vvvv
Common issues this reveals include:
- Missing or incorrect munge key
- Network connectivity problems between nodes
- Incorrect slurm.conf parameters
Validate your slurm.conf
settings with:
slurmd -C
Pay special attention to:
ControlMachine
matches the hostnameSlurmUser
exists in the system- Network addresses are correct
For comprehensive logs, use:
journalctl -u slurmd -b --no-pager
Filter for specific errors with:
journalctl -u slurmd -b -p err
If issues persist, try starting slurmd directly without systemd:
sudo -u slurm /usr/sbin/slurmd
Then check if the PID file gets created:
ls -l /var/run/slurm/slurmd.pid
When attempting to start the SLURM node daemon using systemctl start slurmd.service
, the system reports a critical failure:
Job for slurmd.service failed because a timeout was exceeded.
Mar 23 17:13:43 fedora1 systemd[1]: slurmd.service: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
The immediate issue stems from the daemon's inability to create or access its PID file. Let's verify the directory structure:
# Check directory existence and permissions
ls -ld /var/run/slurm
stat /var/run/slurm
# Expected output should show:
# drwxr-xr-x 2 slurm slurm 4096 Mar 23 17:13 /var/run/slurm
Key parameters in slurm.conf
that affect this behavior:
SlurmUser=slurm
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
Cross-validate these with actual system paths:
# Verify user existence
getent passwd slurm
id slurm
# Check spool directory
ls -ld /var/spool/slurmd
The complete startup flow should follow this pattern:
# Create necessary directories
sudo mkdir -p /var/run/slurm /var/spool/slurmd
sudo chown slurm:slurm /var/run/slurm /var/spool/slurmd
# Set correct SELinux context if applicable
sudo semanage fcontext -a -t slurm_var_run_t "/var/run/slurm(/.*)?"
sudo restorecon -Rv /var/run/slurm
# Reload systemd and restart service
sudo systemctl daemon-reload
sudo systemctl start slurmd
For deeper investigation, enable verbose logging:
# Temporarily modify systemd unit
sudo systemctl edit --full slurmd.service
# Add these parameters under [Service]:
Environment="SLURM_DEBUG=5"
StandardOutput=journal
StandardError=journal
# Then monitor logs with:
journalctl -u slurmd -f
Verify network connectivity between nodes:
# From control node to compute node
ping -c 3 fedora2
nc -zv fedora2 6818
# From compute node to control node
ping -c 3 fedora1
nc -zv fedora1 6817
Always validate your SLURM configuration before restarting services:
# Check config syntax
slurmd -C
slurmctld -C
# Verify running version matches config
slurmd -V
cat /etc/slurm/slurm.conf | grep -i ^SlurmctldVersion