How to Diagnose and Troubleshoot OOM Killer Issues on Linux VPS Running Web/Database Services


2 views

When your Linux system faces severe memory pressure, the Out-Of-Memory (OOM) killer activates to prevent complete system failure by terminating selected processes. To understand recent incidents:


# Check OOM killer events in kernel logs
dmesg | grep -i "oom-killer"
dmesg | grep -i "killed process"

# Alternative log locations
grep -i "oom-killer" /var/log/messages
journalctl -k --grep="oom-killer"

The first process killed isn't necessarily the root cause. Use these tools to investigate memory usage patterns:


# Real-time memory monitoring
vmstat -SM 1 10
free -mh

# Process-level memory analysis
ps aux --sort=-%mem | head -n 15
top -b -o +%MEM -n 1 | head -n 20

For production servers, we need deeper diagnostics:


# Install and run smem for detailed reporting
yum install smem -y
smem -t -k -p | grep -E "www|mysql|postfix"

# Check slab memory usage
cat /proc/meminfo | grep -E "Slab|SReclaimable|SUnreclaim"

# Analyze process memory maps (replace PID)
pmap -x  | sort -n -k3

Implement these proactive measures:


# Configure OOM killer adjustments
echo -17 > /proc//oom_adj      # Protect critical processes
sysctl -w vm.overcommit_memory=2    # More conservative allocation

# MySQL-specific tuning (example)
[mysqld]
performance_schema=ON
innodb_buffer_pool_size = 256M      # Adjust based on available RAM

Automate incident documentation with this script:


#!/bin/bash
LOG_FILE="/var/log/oom_analysis_$(date +%Y%m%d).log"
{
  echo "===== OOM Killer Analysis Report ====="
  date
  echo -e "\nMemory Status:"
  free -m
  echo -e "\nRecent OOM Events:"
  dmesg | grep -i "oom-killer" | tail -n 10
  echo -e "\nTop Memory Consumers:"
  ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -n 15
} > $LOG_FILE

Last Tuesday at 3:47 AM, my monitoring system alerted me that both Apache and SSH became unresponsive on our production VPS. The smoking gun in /var/log/messages:

Jan 14 03:47:12 vps01 kernel: Out of memory: Kill process 2156 (httpd) score 887 or sacrifice child
Jan 14 03:47:12 vps01 kernel: Killed process 2156, UID 48, (httpd) total-vm:245728kB, anon-rss:142892kB, file-rss:428kB

First, we need to reconstruct the memory state before OOM-killer struck. The /var/log/messages contains gold mines:

grep -i 'out of memory' /var/log/messages
grep -i 'oom-killer' /var/log/messages | awk -F'(' '{print $2}' | awk -F')' '{print $1}' | sort | uniq -c

The second command reveals frequent victims - in my case, httpd and mysqld kept appearing.

Install and run dmesg -T | grep -i oom to see kernel-level OOM events with timestamps. For a more detailed post-mortem:

# Install crash utility
yum install crash -y

# Analyze vmcore (if configured)
crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/
crash> kmem -i
crash> log

OOM-killer uses a badness score algorithm. To see what processes were candidates:

# View OOM score of running processes
for f in /proc/*/oom_score; do 
  pid=${f#/proc/}
  pid=${pid%/oom_score}
  echo "$(cat $f) $(ps -p $pid -o comm=)" 
done | sort -nr | head

In my case, this revealed PHP-FPM processes consuming abnormal memory after a WordPress plugin update.

For critical services like SSH, add protection in /etc/sysctl.conf:

# Protect SSH from OOM
echo -17 > /proc/$(pgrep sshd)/oom_adj

# System-wide config
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.panic_on_oom = 0

Now I use this cron job to log memory trends every 5 minutes:

*/5 * * * * echo $(date +\%s) $(free -m | awk '/Mem:/ {print $3,$4,$7}') >> /var/log/mem.log

Combine with this Grafana alert when available memory drops below 10%:

- alert: LowMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Low memory on {{ $labels.instance }}"