We've all encountered this scenario in Sun Grid Engine (SGE) environments: a job gets stuck in the deletion state (dr
) and regular qdel
commands fail when executed by normal users. The system responds with that infuriating message:
job 12345 is already in deletion
Yet mysteriously, the same job disappears immediately when the root user attempts deletion. This permission asymmetry creates unnecessary admin overhead and user frustration.
The dr
(deletion requested) state indicates SGE has received a deletion request but hasn't completed the cleanup. Common causes include:
- Processes hanging during termination
- Filesystem latency on NFS-mounted directories
- Resource manager communication delays
- Permission issues during cleanup
While root access always works, we need non-privileged solutions. Here are tested approaches:
1. The Nuclear Option: Signal Flooding
First identify all remaining processes:
qstat -j JOBID | grep '^usage'
pgrep -u $USER -f JOBID
Then force-kill them:
pkill -9 -u $USER -f JOBID
pkill -9 -u $USER -P PARENT_PID
2. SGE Administrative Commands
If you have limited sudo access:
sudo -u sgeadmin qdel -f JOBID
sudo -u sgeadmin qmod -cj JOBID
3. Cleanup Script Example
Create a user-runnable cleanup script:
#!/bin/bash
JOBID=$1
# Check job state first
STATE=$(qstat -j $JOBID 2>&1 | grep -oP 'state:\s+\K\w+')
if [[ "$STATE" == "dr" ]]; then
# Kill associated processes
pkill -9 -u $USER -f "JOB_ID=$JOBID"
# Force SGE cleanup
qmod -cj $JOBID >/dev/null 2>&1
qdel -f $JOBID >/dev/null 2>&1
echo "Forced cleanup of job $JOBID"
else
echo "Job not in dr state, use regular qdel"
fi
Modify your job scripts to include cleanup traps:
#!/bin/bash
# SGE job script with forced cleanup
trap "cleanup" EXIT TERM INT
cleanup() {
# Explicit process termination
pkill -P $$ # Kill child processes
exit 143
}
# Main job commands here
For permanent solutions, admins should consider:
- Adjusting
terminate_method
in SGE configuration - Setting appropriate
kill_delay
values - Implementing periodic cleanup cron jobs
When working with Sun Grid Engine (SGE), users frequently encounter situations where their jobs get stuck in the deletion state (shown as dr
in qstat output). The frustrating part comes when regular users can't terminate these jobs despite seeing the "already in deletion" message, while root can successfully kill them.
The SGE system handles job deletion through a multi-step process. When a job enters the deletion state (dr
), it means:
- The job has been marked for deletion in SGE's database
- The system is cleaning up allocated resources
- Some process is still holding onto the job
Regular users typically can't intervene in this process because:
1. The job may still be owned by system processes
2. SGE's permission model restricts forceful termination
3. Resource cleanup requires elevated privileges
Here are several approaches to handle stuck jobs without root access:
Method 1: Using qdel with Force Option
Try the forceful deletion command:
qdel -f <jobid>
If that fails, combine with verbose output:
qdel -f -verbose <jobid>
Method 2: Cleanup Through qmod
Sometimes modifying the job state can help:
qmod -d <jobid> # Disable the job
qdel <jobid> # Then try deletion
Method 3: Direct Execution Host Intervention
If you know which execution host ran the job:
# SSH to the execution host
qhost -h <exec_host> -j
# Then manually kill any remaining processes
ps -ef | grep <jobid>
kill -9 <process_ids>
To minimize stuck deletion states:
- Always include proper cleanup in job scripts
- Use timeout mechanisms
- Implement SIGTERM handlers in your applications
Example job script with cleanup:
#!/bin/bash
# $ -S /bin/bash
# $ -N my_job
# $ -cwd
cleanup() {
# Your cleanup commands here
rm -f temporary_files/*
exit 1
}
trap cleanup SIGTERM
# Main job commands
./my_program
If none of the user-level solutions work, you'll need to request your cluster administrators to:
- Use
qdel -f <jobid>
as root - Clear the job from SGE's spool directory
- Restart the qmaster daemon if needed