Diagnosing and Resolving Intermittent TCP Retransmissions in a LAN Environment with VoIP Disruptions


1 views

In our 100-node LAN environment with Windows domain servers and VoIP infrastructure, we're experiencing periodic TCP retransmissions that correlate with:

  • VoIP phones spontaneously rebooting (sometimes during active calls)
  • Brief network share access freezes
  • Database connection drops in administration software

Wireshark captures reveal 2-3 daily clusters of retransmissions (5-100+ packets each), primarily between the PBX and various VoIP phone subsets. Interestingly, these events don't correlate with peak network traffic periods.

Here's a sample Wireshark display filter I've found useful for identifying problematic patterns:

tcp.analysis.retransmission || tcp.analysis.fast_retransmission || 
tcp.analysis.out_of_order || tcp.analysis.lost_segment

When examining retransmission patterns, pay special attention to:

frame.time_delta > 1 && tcp.analysis.retransmission

Before diving deep into protocol analysis, perform these basic checks:

# Check for duplex mismatches (Linux example)
ethtool eth0 | grep -E "Speed|Duplex"

# Verify switch port statistics
show interface counters | include errors|discards

When basic checks don't reveal the issue, implement these monitoring solutions:

1. Continuous Network Baseline

Create a Python script to monitor key metrics:

import psutil
from datetime import datetime

def network_metrics():
    net_io = psutil.net_io_counters()
    return {
        'timestamp': datetime.now().isoformat(),
        'bytes_sent': net_io.bytes_sent,
        'bytes_recv': net_io.bytes_recv,
        'packets_sent': net_io.packets_sent,
        'packets_recv': net_io.packets_recv,
        'errin': net_io.errin,
        'errout': net_io.errout
    }

2. Switch Port Mirroring

Configure SPAN ports on critical switches to capture traffic during events:

# Cisco example
monitor session 1 source interface Gi1/0/1-24
monitor session 1 destination interface Gi1/0/48

For SIP/RTP traffic issues, these Wireshark filters help isolate problems:

sip || rtp || udp.port == 5060 || udp.port == 5061 || 
(udp.port >= 10000 && udp.port <= 20000)

Check for QoS consistency across all network devices:

show mls qos interface statistics

Consider hardware problems when you observe:

  • Retransmissions occurring across multiple switch domains simultaneously
  • Issues persisting during low-traffic periods
  • Problems following no configuration changes

Essential hardware checks include:

# Check for CRC errors (Linux)
cat /sys/class/net/eth0/statistics/rx_crc_errors

When updating network device firmware:

  1. Start with core switches
  2. Proceed to edge switches
  3. Update VoIP phones in controlled batches
  4. Document each update with before/after packet captures

Remember to capture baseline statistics before updates:

# Cisco example
show tech-support > pre-upgrade-tech.txt

In our 100-node LAN with Windows domain servers and VoIP phones, we've observed a persistent issue: phones sporadically reboot (sometimes mid-call) while workstations experience temporary network share access failures. Wireshark captures reveal TCP retransmission clusters (5-100+ packets) occurring 2-3 times daily, primarily between the PBX and random subsets of VoIP phones.

Key observations from packet analysis:

// Sample Wireshark filter showing retransmission patterns
frame.time >= "2023-05-01 14:00:00" && 
frame.time <= "2023-05-01 15:00:00" &&
tcp.analysis.retransmission

The retransmissions exhibit these characteristics:

  • No consistent correlation with network load (occurs during peak and idle periods)
  • Often affects phones on same switch, but also spans distant network segments
  • Coincident retransmissions in file server traffic

Our network topology includes:

Network Map:
Core Switch (Cisco 3850) -- Edge Switches (12x Cisco 2960X)
                         |
                         -- VoIP VLAN (PBX + Phones)
                         -- Data VLAN (Servers + Workstations)

Potential switch-related issues to investigate:

# Cisco IOS commands for diagnostics
show interface counters errors
show spanning-tree vlan 100
show platform hardware qos queue stats interface gi1/0/1

Since the issue manifests most visibly with VoIP devices, we should:

  1. Verify QoS configuration matches vendor requirements
  2. Check for buffer overruns on switch ports
  3. Test with LLDP-MED disabled (known to cause issues with some phones)
// Sample PowerShell to monitor SIP registration status
$registryPath = "HKLM:\Software\VoIPClient\"
Get-ItemProperty -Path $registryPath -Name "LastRegistrationAttempt"

The domain controllers show these relevant configurations:

Windows Network Diagnostics:
netsh interface tcp show global
Get-NetAdapterAdvancedProperty -Name "*" | 
  Where-Object {$_.DisplayName -match "Interrupt Moderation"}

Particular attention should be paid to:

  • TCP Chimney Offload settings
  • Network adapter power management
  • RSS (Receive Side Scaling) configuration

Recommended step-by-step investigation:

1. Baseline Network:
   - Update all switch firmware
   - Document current configurations
   - Establish performance benchmarks

2. Targeted Monitoring:
   - Deploy continuous Wireshark captures
   - Implement NetFlow/sFlow monitoring
   - Log switch CPU/memory utilization

3. Controlled Testing:
   - Isolate VoIP traffic on dedicated links
   - Test with different NIC drivers
   - Validate STP timers

For proactive monitoring, consider this Python snippet to detect retransmission spikes:

import pyshark
from collections import defaultdict

def detect_retransmissions(pcap_file, threshold=10):
    cap = pyshark.FileCapture(pcap_file)
    retrans_counts = defaultdict(int)
    
    for pkt in cap:
        if hasattr(pkt, 'tcp') and hasattr(pkt.tcp, 'analysis_retransmission'):
            src_dst = f"{pkt.ip.src}:{pkt.tcp.srcport} -> {pkt.ip.dst}:{pkt.tcp.dstport}"
            retrans_counts[src_dst] += 1
    
    return {k:v for k,v in retrans_counts.items() if v > threshold}

Based on the observed patterns, we should prioritize:

  1. Switch firmware updates (particularly for spanning-tree implementations)
  2. VoIP VLAN QoS verification and potential reconfiguration
  3. Windows Server TCP stack tuning
  4. Physical layer validation (cable testing, interface error monitoring)