How to Limit Concurrent SGE Jobs per User: Controlling User-Level Job Throttling in Sun Grid Engine


2 views

In Sun Grid Engine (SGE/Univa Grid Engine), job concurrency can be controlled at multiple levels. While most administrators focus on system-wide limits through queue configurations, user-specific throttling is equally important for I/O-intensive workloads.

The simplest approach is using -tc (task concurrency) during job submission:

qsub -tc 100 massive_job_array.sh

This ensures no more than 100 tasks from the array will run simultaneously. However, this only works for job arrays.

For more permanent solutions, create a user-specific complex resource:

qconf -sc | grep -i slot
# Add/modify the 'slots' complex attribute
qconf -mc <<EOF
name           slots
shortcut       s
type           INT
requestable    YES
consumable     YES
default        1
EOF

# Apply to user
qconf -au dame user_slots=100

Create a dedicated PE with slot limitations:

qconf -ap throttled_pe <<EOF
pe_name           throttled_pe
slots             100
user_lists        dave
EOF

For complex scenarios, implement job dependencies:

#!/bin/bash
MAX_JOBS=100

for i in {1..500}; do
  while [ $(qstat -u dave | grep " r " | wc -l) -ge $MAX_JOBS ]; do
    sleep 30
  done
  qsub job_script_${i}.sh
done

Check active job counts with:

qstat -u dave | grep " r " | wc -l

Or use SGE's accounting:

qacct -j -u dave -d 1

When working with Sun Grid Engine (SGE), uncontrolled job submissions can lead to resource contention, especially when dealing with I/O-intensive workloads. A common scenario is when a single user submits hundreds of jobs that simultaneously access shared storage, causing filesystem bottlenecks for all cluster users.

The most straightforward approach is to configure slot limits through SGE's qconf -rattr command. Here's how to implement a per-user limit:

# Set max running jobs for user 'dave' to 100
qconf -rattr queue slots=100 all.q@node001 -u dave
qconf -rattr queue slots=100 all.q@node002 -u dave
[... repeat for all nodes ...]

For temporary limits, SGE's complex values offer more flexibility:

# Create a complex attribute for job limits
qconf -sc >> /tmp/complex_attrs
echo "max_user_jobs    max_user_jobs    INT    <=    YES    YES    0    0" >> /tmp/complex_attrs
qconf -Mc /tmp/complex_attrs

# Apply to specific user
qconf -mattr queue complex_values max_user_jobs=100 all.q

For users who want to self-limit their submissions, job arrays with throttling work well:

#!/bin/bash
#$ -t 1-500
#$ -tc 100  # Maximum concurrent tasks
#$ -q all.q
#$ -cwd

# Your actual job commands here
./process_data.sh $SGE_TASK_ID

After setting limits, verify them with:

qstat -u dave -s r | wc -l  # Count running jobs
qconf -su dave              # Check user configuration
qstat -f -explain a         # View queue status with explanations

For more sophisticated control, you can create a cron job that adjusts limits based on system load:

#!/bin/bash
# Adjust limits when filesystem latency exceeds threshold
FS_LATENCY=$(iostat -dx /shared 1 2 | awk 'NR==4 {print $10}')

if (( $(echo "$FS_LATENCY > 50" | bc -l) )); then
    qconf -mattr queue slots 50 all.q -u dave
    logger "Reduced dave's job limit due to high I/O latency ($FS_LATENCY ms)"
fi