Debugging Excessive TCP Dup ACK & Fast Retransmission in MetroEthernet File Transfers

During file transfers across our MetroEthernet link between two sites (connected via a single SonicWall router), Wireshark captures consistently show:

TCP Dup ACK #1#1
TCP Fast Retransmission

The traceroute shows minimal latency (under 10ms) between endpoints 192.168.2.153 (client) and 192.168.1.101 (server):

traceroute to 192.168.1.101 (192.168.1.101), 30 hops max, 60 byte packets
1  192.168.2.254  0.747 ms
2  192.168.1.101  8.995 ms

We performed multiple hardware swaps with identical results:

Replaced SonicWall with Cisco 1800 series router (same behavior)
Connected laptops directly to provider equipment (same subnet)
Bypassed all customer-premises equipment

The Wireshark analysis reveals these key patterns:

No.     Time        Source             Destination        Protocol Info
1234    1.234567    192.168.2.153      192.168.1.101      TCP [TCP Dup ACK #1#1]
1235    1.234789    192.168.1.101      192.168.2.153      TCP [TCP Fast Retransmission]

Key metrics to calculate from the capture:

Retransmission rate = (Retransmitted packets / Total packets) × 100
Dup ACK frequency = Dup ACK count / Total ACKs

Recommended provider-side tests:

# Continuous ping with timestamps
ping -t 192.168.1.101 | while read line; do echo "$(date): $line"; done

# Path MTU discovery
ping -M do -s 1472 192.168.1.101

# Jitter measurement
sudo apt install iperf3
iperf3 -c 192.168.1.101 -u -b 100M -t 60 -i 1

Possible Linux system tweaks (server-side):

# Check current settings
sysctl -a | grep tcp

# Recommended adjustments
sudo sysctl -w net.ipv4.tcp_sack=1
sudo sysctl -w net.ipv4.tcp_fack=1
sudo sysctl -w net.ipv4.tcp_window_scaling=1
sudo sysctl -w net.ipv4.tcp_timestamps=1
sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0

For file transfer applications, consider implementing:

// Python example using larger socket buffers
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4194304)  # 4MB send buffer
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 4194304)  # 4MB receive buffer

Or implement application-level retry logic:

// JavaScript example with exponential backoff
async function reliableTransfer(data, maxRetries = 5) {
  let attempt = 0;
  while (attempt < maxRetries) {
    try {
      return await transferData(data);
    } catch (err) {
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise(r => setTimeout(r, delay));
      attempt++;
    }
  }
  throw new Error('Max retries exceeded');
}

When working with circuit providers:

Request RFC 2544 testing results
Ask for jitter and latency measurements under load
Request interface error counters from their switches
Demand testing with known-good traffic patterns

When analyzing network performance issues, few things are as frustrating as persistent TCP retransmissions. In your case, we're seeing:

1. Frequent TCP Dup ACK packets
2. TCP Fast Retransmission events
3. Occurring despite low latency (sub-10ms)
4. Persisting across different router hardware

Your testing methodology clearly points to the MetroEthernet circuit as the culprit. Key observations:

Issue persists when bypassing routers entirely
Same behavior when connecting laptops directly to provider equipment
Service provider insists their tests show no problems

Let's examine what wireshark captures typically reveal in such scenarios:

// Sample tshark filter to identify retransmission patterns
tshark -r capture.pcap -Y "tcp.analysis.retransmission || tcp.analysis.fast_retransmission" \
       -T fields -e frame.number -e ip.src -e ip.dst -e tcp.seq -e tcp.ack

The output would show patterns like:

1234 192.168.2.153 192.168.1.101 12345678 87654321 [TCP Fast Retransmission]
1235 192.168.2.153 192.168.1.101 12345678 87654321 [TCP Dup ACK #1]

Since providers often claim "everything tests OK," here's a Python script to gather concrete evidence:

import socket
import time
from collections import defaultdict

def monitor_tcp_performance(dest_ip, dest_port, duration):
    retrans_stats = defaultdict(int)
    start_time = time.time()
    
    while time.time() - start_time < duration:
        try:
            with socket.create_connection((dest_ip, dest_port), timeout=2) as s:
                s.send(b'PING')
                data = s.recv(1024)
                if not data:
                    retrans_stats['timeout'] += 1
        except socket.timeout:
            retrans_stats['timeout'] += 1
        except socket.error as e:
            retrans_stats[str(e)] += 1
    
    return dict(retrans_stats)

# Example usage:
stats = monitor_tcp_performance('192.168.1.101', 445, 300)
print(f"Connection issues observed: {stats}")

MetroEthernet circuits often have hidden MTU constraints. Try this diagnostic:

# Linux MTU path discovery
ping -M do -s 1472 192.168.1.101  # Adjust size down until success

# Windows equivalent
ping -f -l 1472 192.168.1.101

When dealing with uncooperative providers, include these metrics in your reports:

Retransmission rate percentage
Pattern of lost segments
Proof that local equipment isn't the bottleneck

Here's how to calculate retransmission rate from pcap data:

total_packets = $(capinfos capture.pcap | grep "Number of packets" | awk '{print $4}')
retrans_packets = $(tshark -r capture.pcap -Y "tcp.analysis.retransmission" | wc -l)
retrans_rate = $(echo "scale=2; $retrans_packets * 100 / $total_packets" | bc)
echo "Retransmission rate: $retrans_rate%"

To eliminate TCP stack variables, try lower-level testing:

# Use iperf3 for controlled testing
iperf3 -c 192.168.1.101 -t 60 -i 5 -w 256K -Z

# Look for symptoms in output:
[ ID] Interval           Transfer     Bitrate         Retr
[  4]   0.00-5.00   sec   112 MBytes   188 Mbits/sec   43

ServerDevWorker

Debugging Excessive TCP Dup ACK & Fast Retransmission in MetroEthernet File Transfers

Related Articles