As a Linux system administrator managing game servers, I frequently encounter processes that crash and consume 100% CPU indefinitely. While brief spikes to 100% usage are normal during intensive operations, sustained high CPU usage typically indicates a hung process that needs intervention.
The key challenge is distinguishing between temporary high usage (normal operation) and permanent high usage (crashed state). We need to monitor processes over a time window (like 30 seconds) before taking action.
Here's an enhanced version of the script you found, modified to handle multiple processes and configurable thresholds:
#!/bin/bash
# Process killer script - monitors multiple processes by name
# Configuration
PROCESS_NAMES=("srcds" "mysqld" "java") # Processes to monitor
CPU_THRESHOLD=98 # Percentage considered "high"
DURATION_THRESHOLD=30 # Seconds above threshold before kill
CHECK_INTERVAL=5 # Seconds between checks
# Main monitoring loop
while true; do
for proc in "${PROCESS_NAMES[@]}"; do
# Get PID and CPU usage
pid_info=$(ps -C "$proc" -o pid=,pcpu=)
if [ -n "$pid_info" ]; then
pid=$(echo "$pid_info" | awk '{print $1}')
cpu_usage=$(echo "$pid_info" | awk '{print $2}' | cut -d. -f1)
# Check if above threshold
if [ "$cpu_usage" -ge "$CPU_THRESHOLD" ]; then
if [ -z "${high_cpu_start[$pid]}" ]; then
high_cpu_start[$pid]=$(date +%s)
echo "$(date): $proc (PID $pid) exceeded CPU threshold"
else
duration=$(( $(date +%s) - ${high_cpu_start[$pid]} ))
if [ "$duration" -ge "$DURATION_THRESHOLD" ]; then
echo "$(date): Killing $proc (PID $pid) - over threshold for $duration seconds"
kill -9 "$pid"
unset high_cpu_start[$pid]
fi
fi
else
unset high_cpu_start[$pid]
fi
fi
done
sleep $CHECK_INTERVAL
done
For more sophisticated monitoring, consider these options:
- Systemd service configuration: Add CPU usage limits directly in service files
- cgroups: Create control groups with CPU usage limits
- Monit: Lightweight monitoring tool with process watching capabilities
When implementing automated process killing:
- Log all kills for debugging purposes
- Consider implementing automatic restarts after killing
- Set up alerts when processes are killed frequently
- Test thresholds thoroughly for each application type
In Linux server administration, especially when running game servers or other long-running applications, we often encounter processes that occasionally crash and consume 100% CPU indefinitely. These zombie processes can:
- Degrade overall system performance
- Cause cascading failures in dependent services
- Lead to unnecessary hosting costs
Basic approaches like killall
or one-time ps
checks don't work because:
# Bad approach - kills valid high-CPU processes
pkill -f "my_game_server"
Game servers legitimately spike to 100% CPU during operations like map changes or player loads. We need duration-based monitoring.
Here's a Python implementation that monitors multiple processes by name:
#!/usr/bin/env python3
import psutil
import time
from datetime import datetime
TARGET_PROCS = ["srcds_linux", "minecraft_server"]
MAX_DURATION = 30 # seconds
CHECK_INTERVAL = 5 # seconds
process_trackers = {}
while True:
for proc in psutil.process_iter(['pid', 'name', 'cpu_percent']):
if proc.info['name'] in TARGET_PROCS:
pid = proc.info['pid']
cpu = proc.info['cpu_percent']
if cpu >= 99: # 100% is rarely exact
if pid not in process_trackers:
process_trackers[pid] = time.time()
print(f"{datetime.now()} - High CPU detected for PID {pid}")
else:
duration = time.time() - process_trackers[pid]
if duration >= MAX_DURATION:
proc.kill()
print(f"{datetime.now()} - Killed PID {pid} after {duration:.1f}s")
del process_trackers[pid]
else:
if pid in process_trackers:
del process_trackers[pid]
time.sleep(CHECK_INTERVAL)
For enterprise environments, consider adding:
- Logging to syslog or file
- Email/SMS alerts before killing
- CPU core count awareness (100% on 8 cores ≠ 100% on 1 core)
- Process restart automation
For those preferring shell scripts, this Bash version works well:
#!/bin/bash
PROCESS_NAMES=("java" "hl2_linux")
THRESHOLD_SECONDS=30
INTERVAL=10
while true; do
for proc_name in "${PROCESS_NAMES[@]}"; do
pids=$(pgrep "$proc_name")
for pid in $pids; do
usage=$(ps -p "$pid" -o %cpu --no-headers)
if (( $(echo "$usage >= 99" | bc -l) )); then
if [[ -f "/tmp/highcpu_$pid" ]]; then
start_time=$(cat "/tmp/highcpu_$pid")
duration=$(( $(date +%s) - start_time ))
if (( duration >= THRESHOLD_SECONDS )); then
kill -9 "$pid"
rm "/tmp/highcpu_$pid"
logger -t cpuwatch "Killed $proc_name (PID:$pid) after ${THRESHOLD_SECONDS}s"
fi
else
date +%s > "/tmp/highcpu_$pid"
fi
else
rm -f "/tmp/highcpu_$pid"
fi
done
done
sleep $INTERVAL
done
For production servers, configure as a systemd service:
[Unit]
Description=CPU Process Monitor
After=network.target
[Service]
ExecStart=/usr/local/bin/cpu_monitor.py
Restart=always
User=root
[Install]
WantedBy=multi-user.target