When dealing with a high-traffic reverse proxy setup (500+ RPS), the sudden appearance of latency spikes (>1000ms) and occasional packet loss (0.3%) creates a challenging debugging scenario. Let me walk through my systematic troubleshooting approach.
First, let's capture the key symptoms from our monitoring:
# Ping pattern showing the issue
ping -c 1000 loadbalancer | grep -E "time=|timeout"
64 bytes: icmp_seq=0 ttl=56 time=11.624 ms
64 bytes: icmp_seq=1 ttl=56 time=10.494 ms
Request timeout for icmp_seq 2
64 bytes: icmp_seq=2 ttl=56 time=1536.516 ms
64 bytes: icmp_seq=3 ttl=56 time=536.907 ms
Examining our bonded interfaces reveals potential clues:
# Check interface errors
cat /proc/net/dev | grep bond
bond0:527181270 240016223540 1 4 0 1
bond1:430293342 77714410892 1 2 0 1
# Continuous monitoring (run in separate terminal)
watch -n 1 "cat /proc/net/dev | grep bond"
Standard tools don't always show the full picture. Let's use more advanced diagnostics:
# Install tcptrack if needed
sudo apt-get install tcptrack
# Monitor TCP connections in real-time
sudo tcptrack -i bond0
# Check for retransmissions
sudo ss -s
Total: 1245 (kernel 0)
TCP: 1437 (estab 824, closed 520, orphaned 2, synrecv 0, timewait 517/0), ports 0
Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 15 11 4
TCP 917 912 5
INET 932 923 9
FRAG 0 0 0
The existing sysctl.conf looks reasonable, but needs some adjustments:
# Add these to /etc/sysctl.conf
net.core.somaxconn = 32768
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
# Apply changes
sudo sysctl -p
Given the RX/TX errors, let's verify bonding settings:
# Check bond0 mode and slaves
cat /proc/net/bonding/bond0
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
# Verify slave interfaces
ethtool eth0 | grep -E "Speed|Duplex|Errors"
Speed: 1000Mb/s
Duplex: Full
RX errors: 0
TX errors: 0
With high connection rates, conntrack can become a bottleneck:
# Check conntrack table usage
conntrack -L | wc -l
# Optimize timeouts
echo 300 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established
echo 60 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait
echo 30 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_fin_wait
Don't overlook physical layer issues:
# Check for NIC errors
ethtool -S eth0 | grep -i error
rx_errors: 1
tx_errors: 0
# Verify driver settings
ethtool -k eth0 | grep -E "gro|lro|tso"
tcp-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
For ongoing visibility, implement these monitoring tools:
# Install essential monitoring
sudo apt-get install iptraf-ng iftop nethogs
# Run temporary captures:
# Bandwidth usage
iftop -i bond0 -n -P
# Per-process network usage
sudo nethogs bond0
# Detailed interface stats
iptraf-ng -d bond0
The issue manifests as periodic latency spikes (>1000ms) accompanied by packet loss (0.3%) during sustained traffic of ~500 requests/second. ICMP ping tests reveal a consistent pattern:
64 bytes from loadbalancer: icmp_seq=2 ttl=56 time=1536.516 ms
64 bytes from loadbalancer: icmp_seq=3 ttl=56 time=536.907 ms
64 bytes from loadbalancer: icmp_seq=4 ttl=56 time=9.389 ms
Request timeout for icmp_seq 919
Examining the bonded interfaces shows minor errors but nothing catastrophic:
bond0:
RX packets:527181270 errors:1 dropped:4 overruns:0 frame:1
TX packets:413335045 errors:0 dropped:0 overruns:0 carrier:0
bond1:
RX packets:430293342 errors:1 dropped:2 overruns:0 frame:1
TX packets:466983986 errors:0 dropped:0 overruns:0 carrier:0
First, verify if this is a NIC driver issue by checking ring buffer status:
ethtool -g bond0
ethtool -g bond1
Check for TCP retransmits and congestion window collapses:
ss -ti | grep -B1 retrans
netstat -s | grep -E 'retrans|sack|dupack'
Given the UFW log showing connection issues:
[517787.732242] Peer unexpectedly shrunk window 1155488866:1155489425
Check conntrack table status:
sysctl net.netfilter.nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
1. TCP Stack Tuning:
# Add to /etc/sysctl.conf
net.ipv4.tcp_workaround_signed_windows = 1
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
2. NIC Buffer Adjustment:
ethtool -G bond0 rx 4096 tx 4096
ethtool -G bond1 rx 4096 tx 4096
3. Connection Tracking Optimization:
sysctl -w net.netfilter.nf_conntrack_buckets=65536
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
For real-time analysis during spikes, run these simultaneously in different terminals:
# Terminal 1: Packet capture
tcpdump -ni bond0 -s0 -w /tmp/spike.pcap
# Terminal 2: System stats
mpstat -P ALL 1
vmstat 1
# Terminal 3: TCP diagnostics
tcpretrans -i bond0 -l
Verify worker connection handling:
nginx -T | grep -E 'worker_connections|multi_accept'
Enable detailed timing logs in nginx.conf:
log_format timing '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';