When facing an OOM situation where 99% of 64GB RAM is consumed (kmem -i
output shows 62.7GB used), the first step is identifying allocation patterns:
crash> kmem -s
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff88083fc0e800 dentry 192 1165216 1203200 9400 8k
ffff88083fc0e000 size-4096 4096 32768 32768 256 32k
ffff88083f413800 task_struct 2960 30240 30240 210 16k
ffff88083fc0d800 size-8192 8192 16384 16384 128 32k
The system log reveals the OOM killer's victims over 30 seconds. This timeline helps reconstruct memory pressure buildup:
crash> log -m | grep -A5 "Out of memory"
[ 223.556616] Out of memory: Kill process 3189 (portreserve) score 1
[ 223.787234] Out of memory: Kill process 3196 (rsyslogd) score 1
[ 224.237119] Out of memory: Kill process 3728 (dbus-daemon) score 1
...
[ 252.603324] Out of memory: Kill process 4855 (cmfileassistd) score 1
Standard ps
shows minimal userland memory usage (0.0039GB). Focus shifts to kernel threads with:
crash> bt -a
PID: 4925 TASK: ffff880828a38ae0 CPU: 5 COMMAND: "kworker/u:3"
#0 [ffff8808279e7c38] schedule at ffffffff814f8a3c
#1 [ffff8808279e7cc0] worker_thread at ffffffff8108d7b6
#2 [ffff8808279e7d60] kthread at ffffffff8108f0b6
#3 [ffff8808279e7ea0] ret_from_fork at ffffffff8140b30c
For suspected DRBD memory leaks, examine module allocations:
crash> mod -S drbd
MODULE NAME SIZE OBJECTS
ffffffffa01a6000 drbd 217344 3384
crash> sym drbd_alloc_pages
ffffffffa019a3c0 (t) drbd_alloc_pages
Detailed slab analysis reveals potential culprits:
crash> kmem -S drbd
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff88083fc0a000 drbd_request 1032 30240 30240 210 16k
ffff88083fc0a800 drbd_peer_device 1088 32768 32768 256 32k
For bonding/network module issues, inspect socket buffers:
crash> net -s
Family Protocol RX TX Total
IPv4 TCP 1048576 2097152 3145728
IPv4 UDP 524288 524288 1048576
Reconstruct the final memory state before panic:
crash> vm -p
PID: 5079 TASK: ffff88082b882ae0 COMMAND: "bash"
MM PGD RSS TOTAL_VM
ffff88082a8d8000 ffff88082b0e3000 1324k 19348k
VMA START END FLAGS FILE
ffff88082a8d8158 00400000 004f1000 8000875 /bin/bash
The server's demise began with a cascade of 39 OOM kill messages before finally succumbing to the dreaded Kernel panic - not syncing: Out of memory
. What makes this case particularly interesting is that standard process monitoring tools showed no obvious culprits in userspace - the memory hemorrhage appeared to originate from kernel territory.
Let's start with the essential crash utility commands that reveal the smoking gun:
# First, examine overall memory status
crash> kmem -i
# Then check slab allocations (often reveals kernel module leaks)
crash> kmem -s
# For DRBD/bonding module analysis
crash> mod -S drbd
crash> mod -S bonding
# Show kernel memory zones
crash> kmem -z
# Detailed slab cache inspection
crash> kmem -S
The real goldmine comes from analyzing slab allocations. Here's what to look for:
crash> kmem -s
CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME
ffff88083e3d3800 32 1733712 1734400 677 8k kmalloc-32
ffff88083e3d3400 64 987632 988800 1236 8k kmalloc-64
ffff88083e3d4000 256 512000 512000 500 8k kmalloc-256
Compare these numbers against baseline values from a healthy system. Spikes in kmalloc-256 might indicate DRBD buffer issues, while kmalloc-32 could point to network subsystem problems.
Given the network interface manipulation preceding the crash, these commands prove invaluable:
# Show network device structures
crash> net -s
# Examine socket buffers
crash> files -s N
# Check network device memory usage
crash> dev -d
# Specifically for mlx4/mlx5 devices
crash> pci -s | grep -i mellanox
For DRBD-related memory analysis, these specialized commands help:
# Show DRBD resource structures
crash> drbdshow -r
# Check DRBD socket buffers
crash> drbdshow -s
# Examine DRBD metadata
crash> drbdshow -m
# Verify connection states
crash> drbdshow -c
For large memory dumps, automation is key. Here's a bash script to parse crash output:
#!/bin/bash
CRASH_BIN="/usr/bin/crash"
VMLINUX="/usr/lib/debug/lib/modules/$(uname -r)/vmlinux"
COREFILE="$1"
analyze_memory() {
$CRASH_BIN $VMLINUX $COREFILE <<-EOF
set pagination off
kmem -i > kmem_info.txt
kmem -s > kmem_slab.txt
ps -u > user_processes.txt
mod > loaded_modules.txt
bt -a > backtraces.txt
exit
EOF
}
generate_report() {
echo "### Memory Analysis Report ###"
echo "Slab Cache Abnormalities:"
grep -A5 "kmalloc" kmem_slab.txt | sort -k4 -nr
echo -e "\nSuspicious Modules:"
awk '$3 > 1000000 {print}' kmem_slab.txt
echo -e "\nFinal User Processes:"
cat user_processes.txt
echo -e "\nKernel Stack Traces:"
grep "Out of memory" backtraces.txt -A10
}
analyze_memory
generate_report > oom_analysis_$(date +%Y%m%d).txt
Watch for these telltale signs in your analysis:
- Unusually large slab caches associated with specific modules
- Growing allocations between consecutive OOM events
- Network-related structures (sk_buff) consuming excessive memory
- DRBD buffer counts exceeding normal operational thresholds
- Kernel thread stacks showing allocation patterns
Based on this post-mortem, implement these safeguards:
# Add these to /etc/sysctl.conf
vm.panic_on_oom=1
vm.oom_kill_allocating_task=1
kernel.panic=10
# DRBD-specific tuning
echo 2048 > /proc/sys/net/ipv4/tcp_mem
echo 256 > /sys/block/drbd0/queue/max_sectors_kb