Understanding Wall Clock Time vs User Time vs CPU Time in GridEngine Performance Benchmarking


1 views

When analyzing job performance in GridEngine (now known as Univa Grid Engine or Altair Grid Engine), you'll encounter three fundamental time measurements:

# Sample GridEngine output format
=============================================
Wallclock Time: 00:05:23
User Time: 00:03:41
CPU Time: 00:04:12
=============================================

Wall clock time (or elapsed time) represents the actual time taken from job start to completion, as if you were timing it with a physical stopwatch. This includes:

  • All system delays (I/O waits)
  • Network latency
  • Queue waiting time
  • Other processes' interference

Example scenario where wall time differs significantly:

# Python script with sleep
import time

start = time.time()
# Computation
time.sleep(10)  # Artificial delay
# More computation
end = time.time()
print(f"Wall time: {end - start:.2f} seconds")

User time accounts for the CPU time spent executing your application's code in user space. Key characteristics:

  • Excludes system calls and kernel operations
  • Multi-core systems may report values exceeding wall time
  • Best for single-threaded performance comparison

Example using Linux time command:

$ time -p my_application
real 4.23  # Wall time
user 3.78  # User time
sys 0.32   # System time

CPU time (often called system time) includes both user time and time spent in kernel operations:

  • Includes system calls and I/O operations
  • For multi-threaded apps, can exceed wall time
  • Important for analyzing system call overhead

Demonstration in C:

#include <sys/times.h>
#include <unistd.h>

void main() {
    struct tms t;
    times(&t);
    // t.tms_utime = user CPU time
    // t.tms_stime = system CPU time
}

For benchmarking applications:

Metric Best Use Case Limitations
Wall Clock End-to-end system performance Affected by external factors
User Time Algorithm efficiency Misses system overhead
CPU Time Total resource usage Hard to isolate app-specific cost

Practical GridEngine monitoring example:

# qacct output analysis
qacct -j job_id | egrep "wallclock|ru_utime|ru_stime"

When dealing with parallel jobs:

# MPI job with 4 processes might show:
Wallclock: 60s
User Time: 220s (4 cores × 55s each)
CPU Time: 240s (includes system overhead)

For accurate comparisons:

  1. Run multiple iterations
  2. Use dedicated test nodes
  3. Monitor system load averages
  4. Consider using dedicated benchmarking tools like HPL

When analyzing computing jobs in GridEngine (now known as Univa Grid Engine or Altair Grid Engine), we encounter three fundamental time metrics that reveal different aspects of execution performance:

# Sample GridEngine output format
qacct -j job_id | egrep "wallclock|ru_utime|ru_stime"
wallclock   00:05:23
ru_utime    00:03:45
ru_stime    00:01:12

Wall clock time (also called elapsed time) measures the total duration from job start to completion, including all system overhead and waiting periods. This is what you'd measure with a physical stopwatch.

Example scenario: A Python script that makes API calls:

import time
start = time.time()
# API call with network latency
response = requests.get('https://api.example.com/data') 
end = time.time()
print(f"Wall time: {end - start:.2f}s")

User CPU time (ru_utime) accounts for time spent executing application code in user space. Multiple threads/cores can cause this to exceed wall time.

// C program showing user vs system time
#include 
#include 

int main() {
    clock_t start = clock();
    // CPU-intensive user operations
    for(long i=0; i<1000000000; i++);
    clock_t end = clock();
    printf("User CPU time: %.2f sec\n", 
           (double)(end - start)/CLOCKS_PER_SEC);
    return 0;
}

Total CPU time (user + system time) represents all processor cycles consumed. System time (ru_stime) covers kernel operations like I/O handling.

Key differences in database operations:

  • Wall time: Includes connection latency
  • User time: Query processing only
  • System time: Disk I/O operations

For comparing application performance:

  1. Algorithm efficiency: Use user CPU time (eliminates system variance)
  2. Real-world impact: Wall clock time matters for SLA compliance
  3. Resource usage: Total CPU time reveals optimization opportunities

Advanced analysis example using Unix time command:

$ /usr/bin/time -v python script.py
Command being timed: "python script.py"
User time (seconds): 12.45
System time (seconds): 1.87
Percent of CPU this job got: 198%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.21

When submitting jobs with qsub, be aware that:

  • Wall time limits are enforced via -l h_rt=HH:MM:SS
  • CPU time accounting depends on scheduler configuration
  • Multi-core jobs aggregate CPU time across all cores