Monitoring and Alerting for Process Resource Limits: Open Files, Stack Size, and More


15 views

As modern applications become more complex, we're increasingly encountering resource limit issues in production environments. The "too many open files" error is just the tip of the iceberg - processes can also hit limits on stack size, CPU time, memory locks, and other system resources.

Linux provides several ways to monitor current resource usage:

# View current limits for a process
cat /proc/$PID/limits

# System-wide file handle usage
cat /proc/sys/fs/file-nr

# Per-process file handle count
ls -l /proc/$PID/fd | wc -l

For reliable monitoring, consider these approaches:

#!/bin/bash
# Sample monitoring script
warning_threshold=80
critical_threshold=90

get_file_usage() {
  local pid=$1
  local soft_limit=$(grep "Max open files" /proc/$pid/limits | awk '{print $4}')
  local current_usage=$(ls -1 /proc/$pid/fd 2>/dev/null | wc -l)
  echo $((100 * current_usage / soft_limit))
}

for pid in $(pgrep -f "your_service_name"); do
  usage=$(get_file_usage $pid)
  if [ $usage -gt $critical_threshold ]; then
    alert "CRITICAL: Process $pid file usage at ${usage}%"
  elif [ $usage -gt $warning_threshold ]; then
    alert "WARNING: Process $pid file usage at ${usage}%"
  fi
done

To capture when limits are hit, consider these techniques:

  • Configure auditd to watch for ENFILE/EMFILE errors
  • Parse system logs for "too many open files" messages
  • Implement application-level logging when open()/socket() calls fail
# Temporary increase for testing (doesn't persist across reboots)
ulimit -n 65536

# Persistent configuration (system-wide)
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf

# Application-specific configuration (systemd)
[Service]
LimitNOFILE=65536

For deep visibility, eBPF tools can trace resource usage:

# Trace file opens system-wide
sudo opensnoop-bpfcc

# Monitor process resource usage
sudo execsnoop-bpfcc

Popular monitoring solutions can track these metrics:

  • Prometheus node_exporter's file descriptor metrics
  • Datadog's system checks
  • New Relic's infrastructure monitoring

Every sysadmin and developer has faced this scenario: your application crashes mysteriously, and after hours of debugging, you discover it's hitting system-imposed resource limits. The most common culprits are:

  • Maximum open files (file descriptors)
  • Process memory limits
  • User process limits
  • System-wide resource ceilings

The Linux kernel exposes process-level metrics through the /proc filesystem. Here's how to check current file descriptor usage:

# Check file descriptor count for a specific process
ls -1 /proc/$PID/fd | wc -l

# Alternative using lsof (may be slower)
lsof -p $PID | wc -l

For system-wide monitoring, these commands help:

# View current system limits
cat /proc/sys/fs/file-nr

# Show user limits
ulimit -a

# Check kernel-level limits
sysctl -a | grep fs.file-max

For production systems, manual checks won't scale. Here's a Prometheus alert rule that triggers when a process approaches its file descriptor limit:

groups:
- name: resource-limits.rules
  rules:
  - alert: ProcessNearFDLimit
    expr: (process_open_fds{job="node_exporter"} / process_max_fds{job="node_exporter"}) > 0.8
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Process {{ $labels.instance }} is using {{ printf \"%.0f\" (100 * $value) }}% of its file descriptors"
      description: "Process {{ $labels.pid }} ({{ $labels.process }}) is using {{ $value | humanizePercentage }} of its available file descriptors"

Configure system logging to capture limit-related events by adding these rules to /etc/rsyslog.conf:

# Capture failed resource allocations
:msg, contains, "Too many open files" /var/log/resource_errors.log
:msg, contains, "fork: retry: Resource temporarily unavailable" /var/log/resource_errors.log
:msg, contains, "Out of memory" /var/log/resource_errors.log

For systemd-based services, use journald filters:

[Unit]
Description=Monitor resource limit violations

[Service]
ExecStart=/bin/sh -c 'journalctl -f | grep --line-buffered -E "Too many open files|fork: retry|Out of memory" >> /var/log/resource_errors.log'
Restart=always

When you identify processes that need higher limits, adjust these kernel parameters in /etc/sysctl.conf:

# Increase system-wide file descriptor limit
fs.file-max = 2097152

# Allow more inotify watches (common for file watchers)
fs.inotify.max_user_watches = 524288

# Adjust epoll limits (important for high-performance servers)
fs.epoll.max_user_watches = 1048576

Sometimes the solution isn't increasing limits, but fixing resource leaks. Here's a Python script to help identify potential leaks:

import os
import time
import psutil

def monitor_fd_leak(pid, interval=5, iterations=10):
    process = psutil.Process(pid)
    baseline = process.num_fds()
    
    print(f"Baseline FD count: {baseline}")
    
    for i in range(iterations):
        time.sleep(interval)
        current = process.num_fds()
        print(f"Interval {i+1}: {current} FDs ({current - baseline} change)")
        
        if current - baseline > 10:  # Threshold for leak detection
            print("WARNING: Possible file descriptor leak detected!")