Diagnosing and Fixing Network Latency Spikes and Packet Loss in Ubuntu-based Load Balancers


2 views

When dealing with a high-traffic reverse proxy setup (500+ RPS), the sudden appearance of latency spikes (>1000ms) and occasional packet loss (0.3%) creates a challenging debugging scenario. Let me walk through my systematic troubleshooting approach.

First, let's capture the key symptoms from our monitoring:

# Ping pattern showing the issue
ping -c 1000 loadbalancer | grep -E "time=|timeout"
64 bytes: icmp_seq=0 ttl=56 time=11.624 ms
64 bytes: icmp_seq=1 ttl=56 time=10.494 ms
Request timeout for icmp_seq 2
64 bytes: icmp_seq=2 ttl=56 time=1536.516 ms
64 bytes: icmp_seq=3 ttl=56 time=536.907 ms

Examining our bonded interfaces reveals potential clues:

# Check interface errors
cat /proc/net/dev | grep bond
bond0:527181270  240016223540    1      4    0     1
bond1:430293342  77714410892     1      2    0     1

# Continuous monitoring (run in separate terminal)
watch -n 1 "cat /proc/net/dev | grep bond"

Standard tools don't always show the full picture. Let's use more advanced diagnostics:

# Install tcptrack if needed
sudo apt-get install tcptrack

# Monitor TCP connections in real-time
sudo tcptrack -i bond0

# Check for retransmissions
sudo ss -s
Total: 1245 (kernel 0)
TCP:   1437 (estab 824, closed 520, orphaned 2, synrecv 0, timewait 517/0), ports 0

Transport Total     IP        IPv6
*         0         -         -        
RAW       0         0         0        
UDP       15        11        4        
TCP       917       912       5        
INET      932       923       9        
FRAG      0         0         0

The existing sysctl.conf looks reasonable, but needs some adjustments:

# Add these to /etc/sysctl.conf
net.core.somaxconn = 32768
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

# Apply changes
sudo sysctl -p

Given the RX/TX errors, let's verify bonding settings:

# Check bond0 mode and slaves
cat /proc/net/bonding/bond0
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)

# Verify slave interfaces
ethtool eth0 | grep -E "Speed|Duplex|Errors"
Speed: 1000Mb/s
Duplex: Full
RX errors: 0
TX errors: 0

With high connection rates, conntrack can become a bottleneck:

# Check conntrack table usage
conntrack -L | wc -l

# Optimize timeouts
echo 300 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established
echo 60 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait
echo 30 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_fin_wait

Don't overlook physical layer issues:

# Check for NIC errors
ethtool -S eth0 | grep -i error
rx_errors: 1
tx_errors: 0

# Verify driver settings
ethtool -k eth0 | grep -E "gro|lro|tso"
tcp-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

For ongoing visibility, implement these monitoring tools:

# Install essential monitoring
sudo apt-get install iptraf-ng iftop nethogs

# Run temporary captures:
# Bandwidth usage
iftop -i bond0 -n -P

# Per-process network usage
sudo nethogs bond0

# Detailed interface stats
iptraf-ng -d bond0

The issue manifests as periodic latency spikes (>1000ms) accompanied by packet loss (0.3%) during sustained traffic of ~500 requests/second. ICMP ping tests reveal a consistent pattern:

64 bytes from loadbalancer: icmp_seq=2 ttl=56 time=1536.516 ms
64 bytes from loadbalancer: icmp_seq=3 ttl=56 time=536.907 ms
64 bytes from loadbalancer: icmp_seq=4 ttl=56 time=9.389 ms
Request timeout for icmp_seq 919

Examining the bonded interfaces shows minor errors but nothing catastrophic:

bond0:
RX packets:527181270 errors:1 dropped:4 overruns:0 frame:1
TX packets:413335045 errors:0 dropped:0 overruns:0 carrier:0

bond1: 
RX packets:430293342 errors:1 dropped:2 overruns:0 frame:1
TX packets:466983986 errors:0 dropped:0 overruns:0 carrier:0

First, verify if this is a NIC driver issue by checking ring buffer status:

ethtool -g bond0
ethtool -g bond1

Check for TCP retransmits and congestion window collapses:

ss -ti | grep -B1 retrans
netstat -s | grep -E 'retrans|sack|dupack'

Given the UFW log showing connection issues:

[517787.732242] Peer unexpectedly shrunk window 1155488866:1155489425

Check conntrack table status:

sysctl net.netfilter.nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

1. TCP Stack Tuning:

# Add to /etc/sysctl.conf
net.ipv4.tcp_workaround_signed_windows = 1
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

2. NIC Buffer Adjustment:

ethtool -G bond0 rx 4096 tx 4096
ethtool -G bond1 rx 4096 tx 4096

3. Connection Tracking Optimization:

sysctl -w net.netfilter.nf_conntrack_buckets=65536
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

For real-time analysis during spikes, run these simultaneously in different terminals:

# Terminal 1: Packet capture
tcpdump -ni bond0 -s0 -w /tmp/spike.pcap

# Terminal 2: System stats
mpstat -P ALL 1
vmstat 1

# Terminal 3: TCP diagnostics
tcpretrans -i bond0 -l

Verify worker connection handling:

nginx -T | grep -E 'worker_connections|multi_accept'

Enable detailed timing logs in nginx.conf:

log_format timing '$remote_addr - $remote_user [$time_local] '
                 '"$request" $status $body_bytes_sent '
                 '"$http_referer" "$http_user_agent" '
                 'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';