How to Force Kill Stuck SGE Jobs in “Deletion (dr)” State as Non-root User


5 views

We've all encountered this scenario in Sun Grid Engine (SGE) environments: a job gets stuck in the deletion state (dr) and regular qdel commands fail when executed by normal users. The system responds with that infuriating message:

job 12345 is already in deletion

Yet mysteriously, the same job disappears immediately when the root user attempts deletion. This permission asymmetry creates unnecessary admin overhead and user frustration.

The dr (deletion requested) state indicates SGE has received a deletion request but hasn't completed the cleanup. Common causes include:

  • Processes hanging during termination
  • Filesystem latency on NFS-mounted directories
  • Resource manager communication delays
  • Permission issues during cleanup

While root access always works, we need non-privileged solutions. Here are tested approaches:

1. The Nuclear Option: Signal Flooding

First identify all remaining processes:

qstat -j JOBID | grep '^usage'
pgrep -u $USER -f JOBID

Then force-kill them:

pkill -9 -u $USER -f JOBID
pkill -9 -u $USER -P PARENT_PID

2. SGE Administrative Commands

If you have limited sudo access:

sudo -u sgeadmin qdel -f JOBID
sudo -u sgeadmin qmod -cj JOBID

3. Cleanup Script Example

Create a user-runnable cleanup script:

#!/bin/bash
JOBID=$1
# Check job state first
STATE=$(qstat -j $JOBID 2>&1 | grep -oP 'state:\s+\K\w+')
if [[ "$STATE" == "dr" ]]; then
  # Kill associated processes
  pkill -9 -u $USER -f "JOB_ID=$JOBID"
  # Force SGE cleanup
  qmod -cj $JOBID >/dev/null 2>&1
  qdel -f $JOBID >/dev/null 2>&1
  echo "Forced cleanup of job $JOBID"
else
  echo "Job not in dr state, use regular qdel"
fi

Modify your job scripts to include cleanup traps:

#!/bin/bash
# SGE job script with forced cleanup
trap "cleanup" EXIT TERM INT

cleanup() {
  # Explicit process termination
  pkill -P $$  # Kill child processes
  exit 143
}

# Main job commands here

For permanent solutions, admins should consider:

  • Adjusting terminate_method in SGE configuration
  • Setting appropriate kill_delay values
  • Implementing periodic cleanup cron jobs

When working with Sun Grid Engine (SGE), users frequently encounter situations where their jobs get stuck in the deletion state (shown as dr in qstat output). The frustrating part comes when regular users can't terminate these jobs despite seeing the "already in deletion" message, while root can successfully kill them.

The SGE system handles job deletion through a multi-step process. When a job enters the deletion state (dr), it means:

  • The job has been marked for deletion in SGE's database
  • The system is cleaning up allocated resources
  • Some process is still holding onto the job

Regular users typically can't intervene in this process because:

1. The job may still be owned by system processes
2. SGE's permission model restricts forceful termination
3. Resource cleanup requires elevated privileges

Here are several approaches to handle stuck jobs without root access:

Method 1: Using qdel with Force Option

Try the forceful deletion command:

qdel -f <jobid>

If that fails, combine with verbose output:

qdel -f -verbose <jobid>

Method 2: Cleanup Through qmod

Sometimes modifying the job state can help:

qmod -d <jobid>  # Disable the job
qdel <jobid>     # Then try deletion

Method 3: Direct Execution Host Intervention

If you know which execution host ran the job:

# SSH to the execution host
qhost -h <exec_host> -j
# Then manually kill any remaining processes
ps -ef | grep <jobid>
kill -9 <process_ids>

To minimize stuck deletion states:

  • Always include proper cleanup in job scripts
  • Use timeout mechanisms
  • Implement SIGTERM handlers in your applications

Example job script with cleanup:

#!/bin/bash
# $ -S /bin/bash
# $ -N my_job
# $ -cwd

cleanup() {
  # Your cleanup commands here
  rm -f temporary_files/*
  exit 1
}

trap cleanup SIGTERM

# Main job commands
./my_program

If none of the user-level solutions work, you'll need to request your cluster administrators to:

  1. Use qdel -f <jobid> as root
  2. Clear the job from SGE's spool directory
  3. Restart the qmaster daemon if needed