Monitoring and Alerting for Process Resource Limits: Open Files, Stack Size, and More


2 views

As modern applications become more complex, we're increasingly encountering resource limit issues in production environments. The "too many open files" error is just the tip of the iceberg - processes can also hit limits on stack size, CPU time, memory locks, and other system resources.

Linux provides several ways to monitor current resource usage:

# View current limits for a process
cat /proc/$PID/limits

# System-wide file handle usage
cat /proc/sys/fs/file-nr

# Per-process file handle count
ls -l /proc/$PID/fd | wc -l

For reliable monitoring, consider these approaches:

#!/bin/bash
# Sample monitoring script
warning_threshold=80
critical_threshold=90

get_file_usage() {
  local pid=$1
  local soft_limit=$(grep "Max open files" /proc/$pid/limits | awk '{print $4}')
  local current_usage=$(ls -1 /proc/$pid/fd 2>/dev/null | wc -l)
  echo $((100 * current_usage / soft_limit))
}

for pid in $(pgrep -f "your_service_name"); do
  usage=$(get_file_usage $pid)
  if [ $usage -gt $critical_threshold ]; then
    alert "CRITICAL: Process $pid file usage at ${usage}%"
  elif [ $usage -gt $warning_threshold ]; then
    alert "WARNING: Process $pid file usage at ${usage}%"
  fi
done

To capture when limits are hit, consider these techniques:

  • Configure auditd to watch for ENFILE/EMFILE errors
  • Parse system logs for "too many open files" messages
  • Implement application-level logging when open()/socket() calls fail
# Temporary increase for testing (doesn't persist across reboots)
ulimit -n 65536

# Persistent configuration (system-wide)
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf

# Application-specific configuration (systemd)
[Service]
LimitNOFILE=65536

For deep visibility, eBPF tools can trace resource usage:

# Trace file opens system-wide
sudo opensnoop-bpfcc

# Monitor process resource usage
sudo execsnoop-bpfcc

Popular monitoring solutions can track these metrics:

  • Prometheus node_exporter's file descriptor metrics
  • Datadog's system checks
  • New Relic's infrastructure monitoring

Every sysadmin and developer has faced this scenario: your application crashes mysteriously, and after hours of debugging, you discover it's hitting system-imposed resource limits. The most common culprits are:

  • Maximum open files (file descriptors)
  • Process memory limits
  • User process limits
  • System-wide resource ceilings

The Linux kernel exposes process-level metrics through the /proc filesystem. Here's how to check current file descriptor usage:

# Check file descriptor count for a specific process
ls -1 /proc/$PID/fd | wc -l

# Alternative using lsof (may be slower)
lsof -p $PID | wc -l

For system-wide monitoring, these commands help:

# View current system limits
cat /proc/sys/fs/file-nr

# Show user limits
ulimit -a

# Check kernel-level limits
sysctl -a | grep fs.file-max

For production systems, manual checks won't scale. Here's a Prometheus alert rule that triggers when a process approaches its file descriptor limit:

groups:
- name: resource-limits.rules
  rules:
  - alert: ProcessNearFDLimit
    expr: (process_open_fds{job="node_exporter"} / process_max_fds{job="node_exporter"}) > 0.8
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Process {{ $labels.instance }} is using {{ printf \"%.0f\" (100 * $value) }}% of its file descriptors"
      description: "Process {{ $labels.pid }} ({{ $labels.process }}) is using {{ $value | humanizePercentage }} of its available file descriptors"

Configure system logging to capture limit-related events by adding these rules to /etc/rsyslog.conf:

# Capture failed resource allocations
:msg, contains, "Too many open files" /var/log/resource_errors.log
:msg, contains, "fork: retry: Resource temporarily unavailable" /var/log/resource_errors.log
:msg, contains, "Out of memory" /var/log/resource_errors.log

For systemd-based services, use journald filters:

[Unit]
Description=Monitor resource limit violations

[Service]
ExecStart=/bin/sh -c 'journalctl -f | grep --line-buffered -E "Too many open files|fork: retry|Out of memory" >> /var/log/resource_errors.log'
Restart=always

When you identify processes that need higher limits, adjust these kernel parameters in /etc/sysctl.conf:

# Increase system-wide file descriptor limit
fs.file-max = 2097152

# Allow more inotify watches (common for file watchers)
fs.inotify.max_user_watches = 524288

# Adjust epoll limits (important for high-performance servers)
fs.epoll.max_user_watches = 1048576

Sometimes the solution isn't increasing limits, but fixing resource leaks. Here's a Python script to help identify potential leaks:

import os
import time
import psutil

def monitor_fd_leak(pid, interval=5, iterations=10):
    process = psutil.Process(pid)
    baseline = process.num_fds()
    
    print(f"Baseline FD count: {baseline}")
    
    for i in range(iterations):
        time.sleep(interval)
        current = process.num_fds()
        print(f"Interval {i+1}: {current} FDs ({current - baseline} change)")
        
        if current - baseline > 10:  # Threshold for leak detection
            print("WARNING: Possible file descriptor leak detected!")