Monitoring and Analyzing Disk Thrashing & Virtual Memory Performance in Linux Systems

When diagnosing memory pressure on Linux systems, several key metrics reveal thrashing behavior:

# Basic vmstat output (1 second intervals)
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  2  24580 102340  45288 654320  120  340  1024   880 1234 5678 12  8 70 10  0

Critical columns to monitor:

si/so: Swap-ins/swap-outs per second (persistent values > 0 indicate thrashing)
wa: CPU wait time for I/O (high values suggest disk contention)
bi/bo: Block input/output operations (shows disk activity spikes)

For deeper investigation, combine these tools:

# Comprehensive memory snapshot
sudo grep -E '^(Swap|MemFree|MemTotal|Buffers|Cached):' /proc/meminfo

# Per-process swap usage
sudo smem -t -k -s swap

# Disk I/O pressure metrics
iostat -xmt 1

This Bash script monitors thrashing conditions:

#!/bin/bash
THRESHOLD=50  # % of swap usage triggering alert

while true; do
    swap_used=$(free | awk '/Swap/{printf "%.0f", $3/$2*100}')
    [ "$swap_used" -ge "$THRESHOLD" ] && \
        echo "[$(date)] Thrashing detected! Swap usage: $swap_used%" >> /var/log/thrash_monitor.log
    
    # Capture vmstat snapshot
    vmstat 1 5 >> /var/log/vmstat_snapshots.log
    sleep 30
done

The sysstat package provides long-term trends:

# Generate memory usage report for today
sar -r -f /var/log/sa/sa$(date +%d)

# Sample output:
# Linux 5.4.0-135-generic (host) 	01/15/2023 	_x86_64_	(8 CPU)
# 12:00:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
# 12:10:01 AM    324876   7903124     96.05     92308   2123456   8234567     102.34

Key tuning parameters in /etc/sysctl.conf:

vm.swappiness = 10          # Reduce tendency to swap
vm.vfs_cache_pressure = 50  # Balance between inode/dentry cache and page cache
vm.dirty_ratio = 20         # Limit dirty pages before forcing writeback
vm.dirty_background_ratio = 10

For production systems, implement this node_exporter query:

# PromQL for swap activity
100 * (rate(node_vmstat_pswpin[1m]) + rate(node_vmstat_pswpout[1m]))

Create dashboards tracking:

Swap usage % over time
Major page fault rate
Disk I/O queue length
OOM killer events

When your Linux system starts thrashing, you'll typically notice:

- Severe performance degradation
- High disk I/O activity (constantly blinking disk LED)
- System becomes unresponsive to commands
- High CPU wait times (seen in top/htop as %wa)

The Linux ecosystem provides several powerful tools for memory analysis:

vmstat - The Classic Approach

Run vmstat with a sampling interval (in seconds):

vmstat 1 10  # Sample 10 times at 1-second intervals

Key columns to monitor:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

Red flags:

- High 'b' (blocked processes)
- High 'si'/'so' (swap in/swap out)
- High 'wa' (I/O wait percentage)

sar - Historical Perspective

Install sysstat package for comprehensive historical data:

sudo apt install sysstat   # Debian/Ubuntu
sudo yum install sysstat   # RHEL/CentOS

View memory statistics:

sar -r 1 3       # Memory utilization
sar -B 1 3       # Paging statistics
sar -S 1 3       # Swap utilization

htop - Visual Monitoring

For a more interactive view:

sudo apt install htop
htop

Look for:

- Memory bars showing high swap usage
- Processes with high RES/VSZ ratios
- Red-colored memory indicators

Using pidstat for Process-Level Analysis

Monitor individual process memory behavior:

pidstat -r -p ALL 1  # Memory statistics per process
pidstat -d 1        # Disk I/O per process

Custom Monitoring Script

Create a bash script for periodic checks:

#!/bin/bash
while true; do
    echo "===== $(date) ====="
    free -h
    echo "--- Top 5 memory consumers ---"
    ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -6
    echo "--- Swap activity ---"
    grep -i swap /proc/vmstat | grep -v " 0"
    sleep 5
done

Thresholds indicating potential thrashing:

- Swap usage > 30% of total memory
- si/so values consistently > 1000 pages/sec
- wa (I/O wait) > 20% for extended periods
- More than 10% of processes in 'D' state (uninterruptible sleep)

Once identified, consider these adjustments:

1. Increase swappiness (temporary fix):
   sudo sysctl vm.swappiness=10

2. Identify and kill memory-hog processes

3. Add more physical memory

4. Optimize application memory usage

5. Consider using zswap or zram for compression

Add these to /etc/sysctl.conf for long-term stability:

vm.swappiness = 10
vm.vfs_cache_pressure = 50
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Apply changes immediately:

sudo sysctl -p

ServerDevWorker