Understanding and Troubleshooting “Too Many Open Files” Error in Linux: Process-Level FD Limits Explained


4 views

Recently, I encountered a puzzling situation where my application was hitting the "too many open files" error despite having generous user-level limits configured. Here's what I found:

# Current system-wide settings
$ ulimit -n
100000

# Application-specific open files count
$ lsof -n -u myapp | wc -l
2708

Linux actually enforces file descriptor limits at multiple levels:

  1. System-wide maximum: Defined in /proc/sys/fs/file-max
  2. User-level limits: Set via /etc/security/limits.conf or pam_limits
  3. Per-process limits: Inherited from the parent process and can be modified via prlimit

To inspect the actual limit for a running process:

# Method 1: Using /proc
$ cat /proc/$(pidof myapp)/limits | grep "Max open files"

# Method 2: Using prlimit
$ prlimit --pid $(pidof myapp) --nofile

Several situations can cause process limits to be lower than user limits:

  • The process was started before ulimit changes were applied
  • The application uses setrlimit() to self-impose stricter limits
  • The process is containerized with different cgroup limits
  • The application uses multiple threads with shared FD tables

Here are actionable fixes for different scenarios:

# For systemd services (add to service unit file)
[Service]
LimitNOFILE=100000

# For Docker containers
docker run --ulimit nofile=100000:100000 myapp

# For temporary process adjustment
prlimit --pid $(pidof myapp) --nofile=100000:100000

When you suspect FD leaks, monitor changes over time:

#!/bin/bash
while true; do
  ls -l /proc/$(pidof myapp)/fd | wc -l
  sleep 1
done

For high-performance applications, you might need to adjust:

# Increase system-wide maximum
echo 2000000 > /proc/sys/fs/file-max

# For persistent changes, add to /etc/sysctl.conf:
fs.file-max = 2000000

When you encounter "too many open files" errors despite having high system limits configured, you're dealing with Linux's multi-layered resource control system. The key constraints operate at three levels:

# System-wide limits (affects all users)
/etc/security/limits.conf

# Per-process limits (kernel-enforced)
/proc/sys/fs/file-max
/proc/sys/fs/nr_open

# Application-specific limits (check your software docs)

To diagnose exactly where your bottleneck occurs:

# Check current process limits
cat /proc/<PID>/limits

# Verify system-wide FD usage
cat /proc/sys/fs/file-nr

# Compare with your application's actual usage
ls -l /proc/<PID>/fd | wc -l

Even experienced engineers frequently overlook:

  • Thread-per-connection models that don't properly close sockets
  • LEAKED FILE DESCRIPTORS shown via lsof -n -p <PID> | grep DEL
  • Containerized environments imposing additional cgroup limits

For high-performance applications needing thousands of connections:

# Temporary increase (until reboot)
sysctl -w fs.file-max=2000000
sysctl -w fs.nr_open=3000000

# Persistent configuration
echo "fs.file-max = 2000000" >> /etc/sysctl.conf
echo "fs.nr_open = 3000000" >> /etc/sysctl.conf
sysctl -p

Here's how we fixed a Java application hitting FD limits despite high ulimit:

# 1. Find the leaked descriptors
jstack <PID> | grep -A 10 "Socket.*not closed"

# 2. Verify with OS tools
lsof -p <PID> -a -iTCP -nP

# 3. Fix in code (Java example)
try (Socket s = new Socket(host, port);
     InputStream is = s.getInputStream()) {
    // socket auto-closed by try-with-resources
}

For long-running services, implement proactive monitoring:

#!/bin/bash
# Continuous FD usage monitor
while true; do
  fd_count=$(ls -1 /proc/$1/fd | wc -l)
  echo "$(date) - FD count: $fd_count"
  if [ $fd_count -gt $WARNING_THRESHOLD ]; then
    alert_ops_team "$1 approaching FD limit"
  fi
  sleep 60
done