As modern applications become more complex, we're increasingly encountering resource limit issues in production environments. The "too many open files" error is just the tip of the iceberg - processes can also hit limits on stack size, CPU time, memory locks, and other system resources.
Linux provides several ways to monitor current resource usage:
# View current limits for a process
cat /proc/$PID/limits
# System-wide file handle usage
cat /proc/sys/fs/file-nr
# Per-process file handle count
ls -l /proc/$PID/fd | wc -l
For reliable monitoring, consider these approaches:
#!/bin/bash
# Sample monitoring script
warning_threshold=80
critical_threshold=90
get_file_usage() {
local pid=$1
local soft_limit=$(grep "Max open files" /proc/$pid/limits | awk '{print $4}')
local current_usage=$(ls -1 /proc/$pid/fd 2>/dev/null | wc -l)
echo $((100 * current_usage / soft_limit))
}
for pid in $(pgrep -f "your_service_name"); do
usage=$(get_file_usage $pid)
if [ $usage -gt $critical_threshold ]; then
alert "CRITICAL: Process $pid file usage at ${usage}%"
elif [ $usage -gt $warning_threshold ]; then
alert "WARNING: Process $pid file usage at ${usage}%"
fi
done
To capture when limits are hit, consider these techniques:
- Configure auditd to watch for ENFILE/EMFILE errors
- Parse system logs for "too many open files" messages
- Implement application-level logging when open()/socket() calls fail
# Temporary increase for testing (doesn't persist across reboots)
ulimit -n 65536
# Persistent configuration (system-wide)
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf
# Application-specific configuration (systemd)
[Service]
LimitNOFILE=65536
For deep visibility, eBPF tools can trace resource usage:
# Trace file opens system-wide
sudo opensnoop-bpfcc
# Monitor process resource usage
sudo execsnoop-bpfcc
Popular monitoring solutions can track these metrics:
- Prometheus node_exporter's file descriptor metrics
- Datadog's system checks
- New Relic's infrastructure monitoring
Every sysadmin and developer has faced this scenario: your application crashes mysteriously, and after hours of debugging, you discover it's hitting system-imposed resource limits. The most common culprits are:
- Maximum open files (file descriptors)
- Process memory limits
- User process limits
- System-wide resource ceilings
The Linux kernel exposes process-level metrics through the /proc filesystem. Here's how to check current file descriptor usage:
# Check file descriptor count for a specific process ls -1 /proc/$PID/fd | wc -l # Alternative using lsof (may be slower) lsof -p $PID | wc -l
For system-wide monitoring, these commands help:
# View current system limits cat /proc/sys/fs/file-nr # Show user limits ulimit -a # Check kernel-level limits sysctl -a | grep fs.file-max
For production systems, manual checks won't scale. Here's a Prometheus alert rule that triggers when a process approaches its file descriptor limit:
groups: - name: resource-limits.rules rules: - alert: ProcessNearFDLimit expr: (process_open_fds{job="node_exporter"} / process_max_fds{job="node_exporter"}) > 0.8 for: 15m labels: severity: warning annotations: summary: "Process {{ $labels.instance }} is using {{ printf \"%.0f\" (100 * $value) }}% of its file descriptors" description: "Process {{ $labels.pid }} ({{ $labels.process }}) is using {{ $value | humanizePercentage }} of its available file descriptors"
Configure system logging to capture limit-related events by adding these rules to /etc/rsyslog.conf:
# Capture failed resource allocations :msg, contains, "Too many open files" /var/log/resource_errors.log :msg, contains, "fork: retry: Resource temporarily unavailable" /var/log/resource_errors.log :msg, contains, "Out of memory" /var/log/resource_errors.log
For systemd-based services, use journald filters:
[Unit] Description=Monitor resource limit violations [Service] ExecStart=/bin/sh -c 'journalctl -f | grep --line-buffered -E "Too many open files|fork: retry|Out of memory" >> /var/log/resource_errors.log' Restart=always
When you identify processes that need higher limits, adjust these kernel parameters in /etc/sysctl.conf:
# Increase system-wide file descriptor limit fs.file-max = 2097152 # Allow more inotify watches (common for file watchers) fs.inotify.max_user_watches = 524288 # Adjust epoll limits (important for high-performance servers) fs.epoll.max_user_watches = 1048576
Sometimes the solution isn't increasing limits, but fixing resource leaks. Here's a Python script to help identify potential leaks:
import os import time import psutil def monitor_fd_leak(pid, interval=5, iterations=10): process = psutil.Process(pid) baseline = process.num_fds() print(f"Baseline FD count: {baseline}") for i in range(iterations): time.sleep(interval) current = process.num_fds() print(f"Interval {i+1}: {current} FDs ({current - baseline} change)") if current - baseline > 10: # Threshold for leak detection print("WARNING: Possible file descriptor leak detected!")