When diagnosing system performance issues like swap exhaustion or CPU saturation, standard monitoring tools often fall short. While solutions like Cacti, Nagios, or Munin provide excellent system-wide metrics, they typically lack granular process-level visibility - exactly what we need when tracking down rogue processes.
Here are three mature solutions that meet all specified requirements:
# Solution 1: NetData (Real-time Visualization)
sudo apt install netdata
# Access via http://localhost:19999
# Process monitoring enabled by default
NetData provides per-process metrics out of the box with beautiful dashboards. For persistent logging:
# Solution 2: Prometheus + Process Exporter
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.5/process-exporter-0.7.5.linux-amd64.tar.gz
tar xvfz process-exporter-*.tar.gz
./process-exporter --config.path=process.yml
Sample process.yml configuration:
process_names:
- name: "{{.Comm}}"
cmdline:
- '.+'
For maximum flexibility, a simple bash solution:
#!/bin/bash
while true; do
timestamp=$(date +%s)
ps -eo pid,comm,%cpu,%mem,rsz,vsz --sort=-%mem >> /var/log/process_monitor.log
echo "------ ${timestamp} ------" >> /var/log/process_monitor.log
sleep 30
done
For processing the collected data:
# Generate daily CPU averages
awk '/^[0-9]/ {cpu[$2]+=$3; count[$2]++}
END {for(p in cpu) print p, cpu/count}' /var/log/process_monitor.log
For visualization, consider Grafana with the collected metrics or use GoAccess for log analysis.
- Log rotation with logrotate to prevent disk filling
- Process name normalization (consider using exe path rather than comm)
- Monitoring the monitoring system itself
When diagnosing system performance issues like swap exhaustion or runaway processes, standard monitoring tools often fall short by only providing system-wide metrics. Here are several effective solutions for tracking individual process resource consumption:
The Linux /proc filesystem provides real-time process statistics. A simple monitoring script might look like:
#!/bin/bash
while true; do
timestamp=$(date +%s)
ps -eo pid,user,pcpu,pmem,cmd --sort=-%cpu | head -n 10 >> /var/log/process_monitor.log
echo "---" >> /var/log/process_monitor.log
sleep 30
done
This captures top CPU-consuming processes every 30 seconds. For memory-focused monitoring, change --sort=-%cpu
to --sort=-%mem
.
For more sophisticated tracking with historical data:
import psutil
import time
import sqlite3
conn = sqlite3.connect('process_stats.db')
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS process_stats
(timestamp REAL, pid INTEGER, name TEXT, cpu REAL, mem REAL)''')
while True:
for proc in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_percent']):
cursor.execute("INSERT INTO process_stats VALUES (?,?,?,?,?)",
(time.time(), proc.info['pid'], proc.info['name'],
proc.info['cpu_percent'], proc.info['memory_percent']))
conn.commit()
time.sleep(30)
- Netdata: Real-time monitoring with process-level metrics and excellent visualization
- Prometheus + Process Exporter: Collects process metrics and integrates with alerting
- atop: Advanced system and process monitor with historical logging
- Glances: Cross-platform monitoring tool with process tracking
With collected data, you can identify trends using simple SQL:
-- Top memory-consuming processes (daily average)
SELECT name, AVG(mem) as avg_mem
FROM process_stats
WHERE timestamp > strftime('%s','now','-1 day')
GROUP BY name
ORDER BY avg_mem DESC
LIMIT 10;
-- CPU usage spikes detection
SELECT strftime('%H:00', timestamp, 'unixepoch') as hour,
MAX(cpu) as peak_cpu
FROM process_stats
WHERE name = 'mysqld'
GROUP BY hour;
For services like Apache that might cause swap death, consider this alerting rule for Prometheus:
groups:
- name: memory.rules
rules:
- alert: HighMemoryProcess
expr: process_resident_memory_bytes{job="process-exporter"} > 1.5e9
for: 5m
labels:
severity: warning
annotations:
summary: "Process {{ $labels.name }} using excessive memory"
description: "{{ $labels.name }} (PID {{ $labels.pid }}) is using {{ $value }} bytes memory"