Advanced NFS Server Performance Profiling: Client-Specific Metrics & Bottleneck Analysis for Linux Clusters


1 views

When analyzing NFS server performance on SUSE Enterprise Linux 10, the standard /proc/net/rpc/nfsd metrics only tell part of the story. To truly understand client-specific bottlenecks, we need deeper instrumentation that reveals:

  • Per-client file access patterns
  • RPC call distribution and latency
  • Disk I/O contention between clients
  • Network stack overhead

The nfsstat utility provides client breakdowns when used with -c flag:

# Show per-client operations (refresh every 2 seconds)
watch -n 2 "nfsstat -c -n -l"

For more granular analysis, combine iotop with NFS debugging:

# Monitor disk I/O while tracing NFS operations
iotop -oP &
nfsd.ko debug=1023

This SystemTap script tracks per-client latency:

probe nfsd.proc {
    printf("%s [%s] %s %dus\n", client_addr, execname(),
           nfs_proc_name($proc), gettimeofday_us() - @entry(gettimeofday_us()))
}

Key metrics to capture:

Metric Tool Interpretation
RPC queue time stap nfsd_rpc_queue.stp Network/CPU contention
VFS latency funclatency-bpfcc vfs_* Filesystem overhead
Disk service time iostat -x 1 Storage bottleneck

When multiple HPC clients access shared datasets, we observed:

# Client A (simulation job)
nfsv4 WRITE 32768bytes: avg_latency=142ms p95=312ms

# Client B (visualization tool)
nfsv4 READ 131072bytes: avg_latency=89ms p95=201ms

The solution involved implementing client-specific:

  • NFS server thread pools (RDMA_NFSD_THREAD_COUNT=16)
  • Per-export caching policies
  • Network QoS tagging (DSCP class mapping)

For continuous monitoring, we developed this Python collector:

from prometheus_client import Gauge

nfs_client_ops = Gauge('nfs_client_operations',
                      'Per-client NFS operations',
                      ['client_ip', 'operation'])

def collect_metrics():
    with open('/proc/net/rpc/nfsd') as f:
        for line in f:
            if line.startswith('net'):
                parts = line.split()
                nfs_client_ops.labels(client_ip=parts[1],
                                    operation=parts[3]).set(parts[5])

This integrates with existing monitoring stacks for historical analysis of client performance patterns.


When troubleshooting NFS performance issues on SUSE Enterprise Linux 10 (or any Linux distribution), standard /proc/net/rpc/nfsd statistics often don't provide enough granularity. Here's how to dig deeper into client-specific behavior and server-side bottlenecks.

Beyond basic NFS stats, these tools provide critical insights:


# 1. nfsstat for RPC breakdown
nfsstat -rc  # Client RPC stats
nfsstat -s    # Server RPC stats

# 2. Client connection tracking
cat /proc/fs/nfsd/clients/*

# 3. Per-client IO monitoring
iotop -oP     # Requires root

To identify which clients access specific files:


# Install and configure inotify-tools
sudo zypper install inotify-tools
inotifywait -m -r /exported/nfs/share | \
    awk '{print $1 " " $3}' > nfs_access.log

# Combine with client IP data
grep -f <(awk '{print $2}' /proc/fs/nfsd/clients/*/info | cut -d: -f1) \
    nfs_access.log

Use a combination of tools:


# Network traffic per client
nethogs -t eth0

# Alternative method using iptables accounting
sudo iptables -I INPUT -p tcp --dport 2049
sudo iptables -I OUTPUT -p tcp --sport 2049
watch -n 1 iptables -nvxL

Break down RPC call types and latency:


# Detailed RPC statistics
rpcinfo -s
rpcinfo -p

# XDR decoding analysis (requires kernel debugging)
echo 1 > /proc/sys/sunrpc/nfsd_debug
dmesg | grep nfsd | awk '/xid/ {print $6,$7,$8}'

When NFS waits on storage:


# Check IO wait at OS level
vmstat 1 5

# Per-disk latency
iostat -x 1

# NFS-specific disk stats
cat /proc/fs/nfsd/pool_stats

Here's a script that combines multiple metrics:


#!/bin/bash
while true; do
    # Get client connections
    clients=$(grep -oP '([0-9]{1,3}\.){3}[0-9]{1,3}' /proc/fs/nfsd/clients/*/info | sort | uniq)
    
    # Get RPC stats
    rpc_stats=$(nfsstat -s | awk '/call|retrans/ {print $1,$2}')
    
    # Get disk latency
    disk_lat=$(iostat -x | awk '/sd/ {print $1,$10}')
    
    # Output timestamped data
    echo "$(date) | Clients: $clients | RPC: $rpc_stats | Disk: $disk_lat"
    sleep 5
done

For long-term analysis, consider these approaches:


# Send data to Graphite/InfluxDB
echo "nfs.client.$(hostname).rpc $(nfsstat -s | grep calls | awk '{print $2}') $(date +%s)" | \
    nc graphite.example.com 2003

# Or use collectd with NFS plugin
LoadPlugin nfs
<Plugin nfs>
    ReportV2 no
    ReportV3 no
    ReportV4 no
</Plugin>