In our 100-node LAN environment with Windows domain servers and VoIP infrastructure, we're experiencing periodic TCP retransmissions that correlate with:
- VoIP phones spontaneously rebooting (sometimes during active calls)
- Brief network share access freezes
- Database connection drops in administration software
Wireshark captures reveal 2-3 daily clusters of retransmissions (5-100+ packets each), primarily between the PBX and various VoIP phone subsets. Interestingly, these events don't correlate with peak network traffic periods.
Here's a sample Wireshark display filter I've found useful for identifying problematic patterns:
tcp.analysis.retransmission || tcp.analysis.fast_retransmission ||
tcp.analysis.out_of_order || tcp.analysis.lost_segment
When examining retransmission patterns, pay special attention to:
frame.time_delta > 1 && tcp.analysis.retransmission
Before diving deep into protocol analysis, perform these basic checks:
# Check for duplex mismatches (Linux example)
ethtool eth0 | grep -E "Speed|Duplex"
# Verify switch port statistics
show interface counters | include errors|discards
When basic checks don't reveal the issue, implement these monitoring solutions:
1. Continuous Network Baseline
Create a Python script to monitor key metrics:
import psutil
from datetime import datetime
def network_metrics():
net_io = psutil.net_io_counters()
return {
'timestamp': datetime.now().isoformat(),
'bytes_sent': net_io.bytes_sent,
'bytes_recv': net_io.bytes_recv,
'packets_sent': net_io.packets_sent,
'packets_recv': net_io.packets_recv,
'errin': net_io.errin,
'errout': net_io.errout
}
2. Switch Port Mirroring
Configure SPAN ports on critical switches to capture traffic during events:
# Cisco example
monitor session 1 source interface Gi1/0/1-24
monitor session 1 destination interface Gi1/0/48
For SIP/RTP traffic issues, these Wireshark filters help isolate problems:
sip || rtp || udp.port == 5060 || udp.port == 5061 ||
(udp.port >= 10000 && udp.port <= 20000)
Check for QoS consistency across all network devices:
show mls qos interface statistics
Consider hardware problems when you observe:
- Retransmissions occurring across multiple switch domains simultaneously
- Issues persisting during low-traffic periods
- Problems following no configuration changes
Essential hardware checks include:
# Check for CRC errors (Linux)
cat /sys/class/net/eth0/statistics/rx_crc_errors
When updating network device firmware:
- Start with core switches
- Proceed to edge switches
- Update VoIP phones in controlled batches
- Document each update with before/after packet captures
Remember to capture baseline statistics before updates:
# Cisco example
show tech-support > pre-upgrade-tech.txt
In our 100-node LAN with Windows domain servers and VoIP phones, we've observed a persistent issue: phones sporadically reboot (sometimes mid-call) while workstations experience temporary network share access failures. Wireshark captures reveal TCP retransmission clusters (5-100+ packets) occurring 2-3 times daily, primarily between the PBX and random subsets of VoIP phones.
Key observations from packet analysis:
// Sample Wireshark filter showing retransmission patterns
frame.time >= "2023-05-01 14:00:00" &&
frame.time <= "2023-05-01 15:00:00" &&
tcp.analysis.retransmission
The retransmissions exhibit these characteristics:
- No consistent correlation with network load (occurs during peak and idle periods)
- Often affects phones on same switch, but also spans distant network segments
- Coincident retransmissions in file server traffic
Our network topology includes:
Network Map:
Core Switch (Cisco 3850) -- Edge Switches (12x Cisco 2960X)
|
-- VoIP VLAN (PBX + Phones)
-- Data VLAN (Servers + Workstations)
Potential switch-related issues to investigate:
# Cisco IOS commands for diagnostics
show interface counters errors
show spanning-tree vlan 100
show platform hardware qos queue stats interface gi1/0/1
Since the issue manifests most visibly with VoIP devices, we should:
- Verify QoS configuration matches vendor requirements
- Check for buffer overruns on switch ports
- Test with LLDP-MED disabled (known to cause issues with some phones)
// Sample PowerShell to monitor SIP registration status
$registryPath = "HKLM:\Software\VoIPClient\"
Get-ItemProperty -Path $registryPath -Name "LastRegistrationAttempt"
The domain controllers show these relevant configurations:
Windows Network Diagnostics:
netsh interface tcp show global
Get-NetAdapterAdvancedProperty -Name "*" |
Where-Object {$_.DisplayName -match "Interrupt Moderation"}
Particular attention should be paid to:
- TCP Chimney Offload settings
- Network adapter power management
- RSS (Receive Side Scaling) configuration
Recommended step-by-step investigation:
1. Baseline Network:
- Update all switch firmware
- Document current configurations
- Establish performance benchmarks
2. Targeted Monitoring:
- Deploy continuous Wireshark captures
- Implement NetFlow/sFlow monitoring
- Log switch CPU/memory utilization
3. Controlled Testing:
- Isolate VoIP traffic on dedicated links
- Test with different NIC drivers
- Validate STP timers
For proactive monitoring, consider this Python snippet to detect retransmission spikes:
import pyshark
from collections import defaultdict
def detect_retransmissions(pcap_file, threshold=10):
cap = pyshark.FileCapture(pcap_file)
retrans_counts = defaultdict(int)
for pkt in cap:
if hasattr(pkt, 'tcp') and hasattr(pkt.tcp, 'analysis_retransmission'):
src_dst = f"{pkt.ip.src}:{pkt.tcp.srcport} -> {pkt.ip.dst}:{pkt.tcp.dstport}"
retrans_counts[src_dst] += 1
return {k:v for k,v in retrans_counts.items() if v > threshold}
Based on the observed patterns, we should prioritize:
- Switch firmware updates (particularly for spanning-tree implementations)
- VoIP VLAN QoS verification and potential reconfiguration
- Windows Server TCP stack tuning
- Physical layer validation (cable testing, interface error monitoring)