How to Force Kill Stuck SGE Jobs in “Deletion (dr)” State as Non-root User

We've all encountered this scenario in Sun Grid Engine (SGE) environments: a job gets stuck in the deletion state (dr) and regular qdel commands fail when executed by normal users. The system responds with that infuriating message:

job 12345 is already in deletion

Yet mysteriously, the same job disappears immediately when the root user attempts deletion. This permission asymmetry creates unnecessary admin overhead and user frustration.

The dr (deletion requested) state indicates SGE has received a deletion request but hasn't completed the cleanup. Common causes include:

Processes hanging during termination
Filesystem latency on NFS-mounted directories
Resource manager communication delays
Permission issues during cleanup

While root access always works, we need non-privileged solutions. Here are tested approaches:

1. The Nuclear Option: Signal Flooding

First identify all remaining processes:

qstat -j JOBID | grep '^usage'
pgrep -u $USER -f JOBID

Then force-kill them:

pkill -9 -u $USER -f JOBID
pkill -9 -u $USER -P PARENT_PID

2. SGE Administrative Commands

If you have limited sudo access:

sudo -u sgeadmin qdel -f JOBID
sudo -u sgeadmin qmod -cj JOBID

3. Cleanup Script Example

Create a user-runnable cleanup script:

#!/bin/bash
JOBID=$1
# Check job state first
STATE=$(qstat -j $JOBID 2>&1 | grep -oP 'state:\s+\K\w+')
if [[ "$STATE" == "dr" ]]; then
  # Kill associated processes
  pkill -9 -u $USER -f "JOB_ID=$JOBID"
  # Force SGE cleanup
  qmod -cj $JOBID >/dev/null 2>&1
  qdel -f $JOBID >/dev/null 2>&1
  echo "Forced cleanup of job $JOBID"
else
  echo "Job not in dr state, use regular qdel"
fi

Modify your job scripts to include cleanup traps:

#!/bin/bash
# SGE job script with forced cleanup
trap "cleanup" EXIT TERM INT

cleanup() {
  # Explicit process termination
  pkill -P $$  # Kill child processes
  exit 143
}

# Main job commands here

For permanent solutions, admins should consider:

Adjusting terminate_method in SGE configuration
Setting appropriate kill_delay values
Implementing periodic cleanup cron jobs

When working with Sun Grid Engine (SGE), users frequently encounter situations where their jobs get stuck in the deletion state (shown as dr in qstat output). The frustrating part comes when regular users can't terminate these jobs despite seeing the "already in deletion" message, while root can successfully kill them.

The SGE system handles job deletion through a multi-step process. When a job enters the deletion state (dr), it means:

The job has been marked for deletion in SGE's database
The system is cleaning up allocated resources
Some process is still holding onto the job

Regular users typically can't intervene in this process because:

1. The job may still be owned by system processes
2. SGE's permission model restricts forceful termination
3. Resource cleanup requires elevated privileges

Here are several approaches to handle stuck jobs without root access:

Method 1: Using qdel with Force Option

Try the forceful deletion command:

qdel -f <jobid>

If that fails, combine with verbose output:

qdel -f -verbose <jobid>

Method 2: Cleanup Through qmod

Sometimes modifying the job state can help:

qmod -d <jobid>  # Disable the job
qdel <jobid>     # Then try deletion

Method 3: Direct Execution Host Intervention

If you know which execution host ran the job:

# SSH to the execution host
qhost -h <exec_host> -j
# Then manually kill any remaining processes
ps -ef | grep <jobid>
kill -9 <process_ids>

To minimize stuck deletion states:

Always include proper cleanup in job scripts
Use timeout mechanisms
Implement SIGTERM handlers in your applications

Example job script with cleanup:

#!/bin/bash
# $ -S /bin/bash
# $ -N my_job
# $ -cwd

cleanup() {
  # Your cleanup commands here
  rm -f temporary_files/*
  exit 1
}

trap cleanup SIGTERM

# Main job commands
./my_program

If none of the user-level solutions work, you'll need to request your cluster administrators to:

Use qdel -f <jobid> as root
Clear the job from SGE's spool directory
Restart the qmaster daemon if needed

ServerDevWorker