Optimizing TCP Throughput on High-Latency Networks: Linux Kernel Tuning and Performance Analysis


2 views

When dealing with high-latency networks (100ms RTT in your case), traditional TCP configurations often underutilize available bandwidth. Your current setup shows:

TCP Window Size: 5.2MB (well configured)
Retransmission Rate: 0.29% (9018144/3085179704)
Average Congestion Window: 3.3MB

The 200Mbps throughput suggests the congestion window isn't scaling properly despite using TCP Scalable. Let's examine why.

Your configuration shows proper window scaling capability:

# sysctl values already set
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1

However, the average congestion window (owin) of 3.3MB is below the maximum advertised window (5.2MB). This indicates either:

  • Insufficient buffer space for the congestion algorithm to grow
  • Packet loss triggering unnecessary window reduction
  • Delayed ACKs causing window growth stagnation

Add these to your existing configuration:

# Enable BBR for better high-latency performance
echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf

# Optimize buffer management
echo "net.ipv4.tcp_adv_win_scale=2" >> /etc/sysctl.conf
echo "net.ipv4.tcp_app_win=31" >> /etc/sysctl.conf

# Adjust delayed ACK behavior
echo "net.ipv4.tcp_delack_min=10" >> /etc/sysctl.conf
echo "net.ipv4.tcp_slow_start_after_idle=0" >> /etc/sysctl.conf

# Apply changes
sysctl -p

Use these tools to verify improvements:

# Real-time monitoring
ss -t -i -n -p state established '( dport = :5201 )'

# TCP diagnostics
tcptrace -l --csv your_capture.pcap > analysis.csv

Key metrics to watch:

  • Congestion window size over time
  • Retransmission patterns
  • RTT variance

Consider testing these algorithms:

# Available algorithms
sysctl net.ipv4.tcp_available_congestion_control

# Try BBR (often best for high latency)
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Or try HTCP
modprobe tcp_htcp
sysctl -w net.ipv4.tcp_congestion_control=htcp

Your buffer sizes are good, but ensure proper allocation:

# Check actual buffer allocation
cat /proc/sys/net/ipv4/tcp_mem
cat /proc/net/sockstat

# For persistent configuration:
echo "net.core.rmem_max=16777216" >> /etc/sysctl.conf
echo "net.core.wmem_max=16777216" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem=4096 87380 16777216" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem=4096 65536 16777216" >> /etc/sysctl.conf

Additional tweaks for high-latency scenarios:

# Increase TCP selective ACK buckets
echo "net.ipv4.tcp_max_sack_blks=64" >> /etc/sysctl.conf

# Optimize timestamp handling
echo "net.ipv4.tcp_tw_reuse=1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_frto=2" >> /etc/sysctl.conf

# Disable unnecessary features
echo "net.ipv4.tcp_sack=0" >> /etc/sysctl.conf
echo "net.ipv4.tcp_dsack=0" >> /etc/sysctl.conf

When dealing with high-latency networks (100ms RTT in your case), the fundamental constraint is the bandwidth-delay product (BDP). For 790Mbps with 100ms RTT:

BDP = Bandwidth * RTT = (790 * 10^6 bits/sec) * 0.1 sec = 9.875MB

Your current window settings (5.2MB) are below this theoretical requirement. While you've increased buffer sizes, several other factors need consideration.

Your sysctl settings show good starting points, but let's analyze the key metrics from your test:

# Current TCP memory settings
echo "8192 7061504 7061504" > /proc/sys/net/ipv4/tcp_rmem
echo "8192 7061504 7061504" > /proc/sys/net/ipv4/tcp_wmem
echo 7061504 > /proc/sys/net/core/rmem_max
echo 7061504 > /proc/sys/net/core/wmem_max

While 'scalable' is a good choice, consider these alternatives with their typical use cases:

# Available congestion controls
cat /proc/sys/net/ipv4/tcp_available_congestion_control

# Try BBR for high-BDP networks
echo "bbr" > /proc/sys/net/ipv4/tcp_congestion_control

BBR often outperforms traditional loss-based algorithms in high-latency scenarios by modeling the network path.

These additional settings can significantly impact performance:

# Increase TCP window scaling
echo 1 > /proc/sys/net/ipv4/tcp_window_scaling

# Enable TCP timestamps for better RTT estimation
echo 1 > /proc/sys/net/ipv4/tcp_timestamps

# Adjust keepalive settings
echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl

When benchmarking, use these iperf3 parameters for more accurate results:

# On server
iperf3 -s -p 5201

# On client (with proper window size)
iperf3 -c server_ip -p 5201 -t 120 -w 8M -P 4 -O 3 -R

Key flags:
- -w 8M: Sets window size to 8MB
- -P 4: Uses 4 parallel streams
- -O 3: Omits first 3 seconds for warmup
- -R: Reverse mode (server-to-client)

For extreme performance needs, consider these kernel parameters:

# Increase socket buffers
echo "net.core.rmem_default=12582912" >> /etc/sysctl.conf
echo "net.core.wmem_default=12582912" >> /etc/sysctl.conf
echo "net.core.rmem_max=12582912" >> /etc/sysctl.conf
echo "net.core.wmem_max=12582912" >> /etc/sysctl.conf

# TCP memory settings (min, default, max)
echo "net.ipv4.tcp_rmem=4096 12582912 25165824" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem=4096 12582912 25165824" >> /etc/sysctl.conf

# Apply changes
sysctl -p

Use these commands to verify your settings during testing:

# Real-time TCP statistics
ss -t -i -n -p

# Detailed socket information
cat /proc/net/tcp

# Network interface statistics
ethtool -S eth0

Remember that optimal settings depend on your specific network characteristics. Always test changes methodically and measure their impact.