Debugging Excessive TCP Dup ACK & Fast Retransmission in MetroEthernet File Transfers


4 views

During file transfers across our MetroEthernet link between two sites (connected via a single SonicWall router), Wireshark captures consistently show:

TCP Dup ACK #1#1
TCP Fast Retransmission

The traceroute shows minimal latency (under 10ms) between endpoints 192.168.2.153 (client) and 192.168.1.101 (server):

traceroute to 192.168.1.101 (192.168.1.101), 30 hops max, 60 byte packets
1  192.168.2.254  0.747 ms
2  192.168.1.101  8.995 ms

We performed multiple hardware swaps with identical results:

  • Replaced SonicWall with Cisco 1800 series router (same behavior)
  • Connected laptops directly to provider equipment (same subnet)
  • Bypassed all customer-premises equipment

The Wireshark analysis reveals these key patterns:

No.     Time        Source             Destination        Protocol Info
1234    1.234567    192.168.2.153      192.168.1.101      TCP [TCP Dup ACK #1#1]
1235    1.234789    192.168.1.101      192.168.2.153      TCP [TCP Fast Retransmission]

Key metrics to calculate from the capture:

Retransmission rate = (Retransmitted packets / Total packets) × 100
Dup ACK frequency = Dup ACK count / Total ACKs

Recommended provider-side tests:

# Continuous ping with timestamps
ping -t 192.168.1.101 | while read line; do echo "$(date): $line"; done

# Path MTU discovery
ping -M do -s 1472 192.168.1.101

# Jitter measurement
sudo apt install iperf3
iperf3 -c 192.168.1.101 -u -b 100M -t 60 -i 1

Possible Linux system tweaks (server-side):

# Check current settings
sysctl -a | grep tcp

# Recommended adjustments
sudo sysctl -w net.ipv4.tcp_sack=1
sudo sysctl -w net.ipv4.tcp_fack=1
sudo sysctl -w net.ipv4.tcp_window_scaling=1
sudo sysctl -w net.ipv4.tcp_timestamps=1
sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0

For file transfer applications, consider implementing:

// Python example using larger socket buffers
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4194304)  # 4MB send buffer
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 4194304)  # 4MB receive buffer

Or implement application-level retry logic:

// JavaScript example with exponential backoff
async function reliableTransfer(data, maxRetries = 5) {
  let attempt = 0;
  while (attempt < maxRetries) {
    try {
      return await transferData(data);
    } catch (err) {
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise(r => setTimeout(r, delay));
      attempt++;
    }
  }
  throw new Error('Max retries exceeded');
}

When working with circuit providers:

  • Request RFC 2544 testing results
  • Ask for jitter and latency measurements under load
  • Request interface error counters from their switches
  • Demand testing with known-good traffic patterns

When analyzing network performance issues, few things are as frustrating as persistent TCP retransmissions. In your case, we're seeing:

1. Frequent TCP Dup ACK packets
2. TCP Fast Retransmission events
3. Occurring despite low latency (sub-10ms)
4. Persisting across different router hardware

Your testing methodology clearly points to the MetroEthernet circuit as the culprit. Key observations:

  • Issue persists when bypassing routers entirely
  • Same behavior when connecting laptops directly to provider equipment
  • Service provider insists their tests show no problems

Let's examine what wireshark captures typically reveal in such scenarios:

// Sample tshark filter to identify retransmission patterns
tshark -r capture.pcap -Y "tcp.analysis.retransmission || tcp.analysis.fast_retransmission" \
       -T fields -e frame.number -e ip.src -e ip.dst -e tcp.seq -e tcp.ack

The output would show patterns like:

1234 192.168.2.153 192.168.1.101 12345678 87654321 [TCP Fast Retransmission]
1235 192.168.2.153 192.168.1.101 12345678 87654321 [TCP Dup ACK #1]

Since providers often claim "everything tests OK," here's a Python script to gather concrete evidence:

import socket
import time
from collections import defaultdict

def monitor_tcp_performance(dest_ip, dest_port, duration):
    retrans_stats = defaultdict(int)
    start_time = time.time()
    
    while time.time() - start_time < duration:
        try:
            with socket.create_connection((dest_ip, dest_port), timeout=2) as s:
                s.send(b'PING')
                data = s.recv(1024)
                if not data:
                    retrans_stats['timeout'] += 1
        except socket.timeout:
            retrans_stats['timeout'] += 1
        except socket.error as e:
            retrans_stats[str(e)] += 1
    
    return dict(retrans_stats)

# Example usage:
stats = monitor_tcp_performance('192.168.1.101', 445, 300)
print(f"Connection issues observed: {stats}")

MetroEthernet circuits often have hidden MTU constraints. Try this diagnostic:

# Linux MTU path discovery
ping -M do -s 1472 192.168.1.101  # Adjust size down until success

# Windows equivalent
ping -f -l 1472 192.168.1.101

When dealing with uncooperative providers, include these metrics in your reports:

  1. Retransmission rate percentage
  2. Pattern of lost segments
  3. Proof that local equipment isn't the bottleneck

Here's how to calculate retransmission rate from pcap data:

total_packets = $(capinfos capture.pcap | grep "Number of packets" | awk '{print $4}')
retrans_packets = $(tshark -r capture.pcap -Y "tcp.analysis.retransmission" | wc -l)
retrans_rate = $(echo "scale=2; $retrans_packets * 100 / $total_packets" | bc)
echo "Retransmission rate: $retrans_rate%"

To eliminate TCP stack variables, try lower-level testing:

# Use iperf3 for controlled testing
iperf3 -c 192.168.1.101 -t 60 -i 5 -w 256K -Z

# Look for symptoms in output:
[ ID] Interval           Transfer     Bitrate         Retr
[  4]   0.00-5.00   sec   112 MBytes   188 Mbits/sec   43