How to Automatically Kill High CPU Usage Processes After Threshold Time on Linux


1 views

As a Linux system administrator managing game servers, I frequently encounter processes that crash and consume 100% CPU indefinitely. While brief spikes to 100% usage are normal during intensive operations, sustained high CPU usage typically indicates a hung process that needs intervention.

The key challenge is distinguishing between temporary high usage (normal operation) and permanent high usage (crashed state). We need to monitor processes over a time window (like 30 seconds) before taking action.

Here's an enhanced version of the script you found, modified to handle multiple processes and configurable thresholds:

#!/bin/bash
# Process killer script - monitors multiple processes by name

# Configuration
PROCESS_NAMES=("srcds" "mysqld" "java") # Processes to monitor
CPU_THRESHOLD=98                        # Percentage considered "high"
DURATION_THRESHOLD=30                   # Seconds above threshold before kill
CHECK_INTERVAL=5                        # Seconds between checks

# Main monitoring loop
while true; do
    for proc in "${PROCESS_NAMES[@]}"; do
        # Get PID and CPU usage
        pid_info=$(ps -C "$proc" -o pid=,pcpu=)
        if [ -n "$pid_info" ]; then
            pid=$(echo "$pid_info" | awk '{print $1}')
            cpu_usage=$(echo "$pid_info" | awk '{print $2}' | cut -d. -f1)
            
            # Check if above threshold
            if [ "$cpu_usage" -ge "$CPU_THRESHOLD" ]; then
                if [ -z "${high_cpu_start[$pid]}" ]; then
                    high_cpu_start[$pid]=$(date +%s)
                    echo "$(date): $proc (PID $pid) exceeded CPU threshold"
                else
                    duration=$(( $(date +%s) - ${high_cpu_start[$pid]} ))
                    if [ "$duration" -ge "$DURATION_THRESHOLD" ]; then
                        echo "$(date): Killing $proc (PID $pid) - over threshold for $duration seconds"
                        kill -9 "$pid"
                        unset high_cpu_start[$pid]
                    fi
                fi
            else
                unset high_cpu_start[$pid]
            fi
        fi
    done
    sleep $CHECK_INTERVAL
done

For more sophisticated monitoring, consider these options:

  • Systemd service configuration: Add CPU usage limits directly in service files
  • cgroups: Create control groups with CPU usage limits
  • Monit: Lightweight monitoring tool with process watching capabilities

When implementing automated process killing:

  • Log all kills for debugging purposes
  • Consider implementing automatic restarts after killing
  • Set up alerts when processes are killed frequently
  • Test thresholds thoroughly for each application type

In Linux server administration, especially when running game servers or other long-running applications, we often encounter processes that occasionally crash and consume 100% CPU indefinitely. These zombie processes can:

  • Degrade overall system performance
  • Cause cascading failures in dependent services
  • Lead to unnecessary hosting costs

Basic approaches like killall or one-time ps checks don't work because:

# Bad approach - kills valid high-CPU processes
pkill -f "my_game_server"

Game servers legitimately spike to 100% CPU during operations like map changes or player loads. We need duration-based monitoring.

Here's a Python implementation that monitors multiple processes by name:

#!/usr/bin/env python3
import psutil
import time
from datetime import datetime

TARGET_PROCS = ["srcds_linux", "minecraft_server"]
MAX_DURATION = 30  # seconds
CHECK_INTERVAL = 5  # seconds

process_trackers = {}

while True:
    for proc in psutil.process_iter(['pid', 'name', 'cpu_percent']):
        if proc.info['name'] in TARGET_PROCS:
            pid = proc.info['pid']
            cpu = proc.info['cpu_percent']
            
            if cpu >= 99:  # 100% is rarely exact
                if pid not in process_trackers:
                    process_trackers[pid] = time.time()
                    print(f"{datetime.now()} - High CPU detected for PID {pid}")
                else:
                    duration = time.time() - process_trackers[pid]
                    if duration >= MAX_DURATION:
                        proc.kill()
                        print(f"{datetime.now()} - Killed PID {pid} after {duration:.1f}s")
                        del process_trackers[pid]
            else:
                if pid in process_trackers:
                    del process_trackers[pid]
    
    time.sleep(CHECK_INTERVAL)

For enterprise environments, consider adding:

  1. Logging to syslog or file
  2. Email/SMS alerts before killing
  3. CPU core count awareness (100% on 8 cores ≠ 100% on 1 core)
  4. Process restart automation

For those preferring shell scripts, this Bash version works well:

#!/bin/bash
PROCESS_NAMES=("java" "hl2_linux")
THRESHOLD_SECONDS=30
INTERVAL=10

while true; do
    for proc_name in "${PROCESS_NAMES[@]}"; do
        pids=$(pgrep "$proc_name")
        for pid in $pids; do
            usage=$(ps -p "$pid" -o %cpu --no-headers)
            if (( $(echo "$usage >= 99" | bc -l) )); then
                if [[ -f "/tmp/highcpu_$pid" ]]; then
                    start_time=$(cat "/tmp/highcpu_$pid")
                    duration=$(( $(date +%s) - start_time ))
                    if (( duration >= THRESHOLD_SECONDS )); then
                        kill -9 "$pid"
                        rm "/tmp/highcpu_$pid"
                        logger -t cpuwatch "Killed $proc_name (PID:$pid) after ${THRESHOLD_SECONDS}s"
                    fi
                else
                    date +%s > "/tmp/highcpu_$pid"
                fi
            else
                rm -f "/tmp/highcpu_$pid"
            fi
        done
    done
    sleep $INTERVAL
done

For production servers, configure as a systemd service:

[Unit]
Description=CPU Process Monitor
After=network.target

[Service]
ExecStart=/usr/local/bin/cpu_monitor.py
Restart=always
User=root

[Install]
WantedBy=multi-user.target