Resolving Persistent TIME_WAIT TCP Connections on Windows Server 2008 in AWS EC2 Environment


4 views

When dealing with high-traffic web applications on Windows Server 2008 (especially in cloud environments like AWS EC2), the accumulation of TCP connections in TIME_WAIT state can become a serious bottleneck. The TIME_WAIT state is a normal TCP protocol mechanism where the local endpoint (our server) maintains the connection information for 2*MSL (Maximum Segment Lifetime) after closing the connection, typically 240 seconds by default on Windows.

In your case, several factors contribute to this issue:

  • Ephemeral port exhaustion: Windows Server 2008 has a default dynamic port range of 49152-65535 (16383 ports). With ~85k connections in TIME_WAIT, you're exhausting available ports.
  • Keep-alive settings: While keep-alive improves performance, it can exacerbate TIME_WAIT accumulation with high traffic.
  • TCP stack limitations: Older Windows versions have less efficient TIME_WAIT handling compared to modern systems.

Add these registry settings under HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"TcpTimedWaitDelay"=dword:0000001e ; 30 seconds (hex 1E)
"MaxUserPort"=dword:0000fffe ; 65534 (increase ephemeral ports)
"StrictTimeWaitSeqCheck"=dword:00000001 ; Enable strict checking
"MaxHashTableSize"=dword:00010000 ; Increase TCP hash table

After applying, reboot the server or run:

netsh int ipv4 set dynamicport tcp start=1025 num=64510
netsh int ipv4 set dynamicport udp start=1025 num=64510

Adjust your web server settings to better manage connections:

# In httpd.conf
KeepAlive On
KeepAliveTimeout 5
MaxKeepAliveRequests 100
<IfModule mpm_winnt_module>
    ThreadsPerChild 250
    MaxConnectionsPerChild 10000
</IfModule>

# In Tomcat's server.xml
<Connector port="8080" protocol="HTTP/1.1"
           connectionTimeout="20000"
           maxThreads="500"
           acceptCount="1000"
           enableLookups="false"
           maxKeepAliveRequests="100"
           keepAliveTimeout="5000"/>

Consider implementing TCP connection pooling at application level. Here's a Java example using Apache HttpClient:

PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(
    RegistryBuilder.<ConnectionSocketFactory>create()
        .register("http", PlainConnectionSocketFactory.getSocketFactory())
        .build());
        
cm.setMaxTotal(500);
cm.setDefaultMaxPerRoute(100);

RequestConfig config = RequestConfig.custom()
    .setConnectTimeout(5000)
    .setSocketTimeout(15000)
    .build();

CloseableHttpClient httpClient = HttpClients.custom()
    .setConnectionManager(cm)
    .setDefaultRequestConfig(config)
    .build();

Create a PowerShell script to monitor TIME_WAIT states:

# Get TIME_WAIT connections count
$timeWaitCount = (netstat -ano | Select-String "TIME_WAIT").Count
Write-Host "Current TIME_WAIT connections: $timeWaitCount"

# Check ephemeral port usage
$usedPorts = (netstat -ano | Select-String "TCP").Count
$availablePorts = 65535 - 49152
$portUsage = ($usedPorts / $availablePorts) * 100
Write-Host "Ephemeral port usage: $portUsage%"

# If critical, recycle application pool
if ($portUsage -gt 90) {
    Import-Module WebAdministration
    Restart-WebAppPool -Name "YourAppPool"
    Write-Host "Application pool recycled due to high port usage"
}

For AWS environments, consider these additional measures:

  • Use a Network Load Balancer (NLB) instead of direct EC2 connections
  • Implement Auto Scaling to distribute load across multiple instances
  • Consider migrating to a newer Windows Server version with better TCP stack
  • Use EC2 Launch Templates to ensure consistent TCP/IP configuration across instances

When dealing with high-traffic web servers on Windows Server 2008 (especially on AWS EC2), you might encounter a situation where thousands of TCP connections remain stuck in TIME_WAIT state. This occurs even after stopping your web server (Apache httpd + Tomcat 6.02 in this case), and can eventually lead to connection exhaustion.

Key indicators we're seeing:

  • 69,250+ connections on port 80 in TIME_WAIT
  • 15,000 additional connections on other ports
  • TCPv4 Active Connections: 145K
  • TCPv4 Passive Connections: 475K
  • Connection failures and resets occurring

The default TIME_WAIT duration on Windows is 4 minutes (2*MSL), but several factors can prevent proper cleanup:

  • Insufficient ephemeral port range for the connection volume
  • TCP connection recycling not properly configured
  • Kernel resources being exhausted
  • Possible connection leaks in the web stack

First, let's check and modify the TCP/IP parameters:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"TcpTimedWaitDelay"=dword:0000001e
"MaxUserPort"=dword:0000fffe
"TcpNumConnections"=dword:00fffffe

This sets:

  • TcpTimedWaitDelay to 30 seconds (0x1e)
  • MaxUserPort to 65534 (0xfffe)
  • TcpNumConnections to 16777214 (0xfffffe)

For automated management, create a PowerShell script to monitor and alert:

# TIME_WAIT connection monitor
$timeWaitCount = (netstat -ano | Select-String "TIME_WAIT").Count
$threshold = 50000

if ($timeWaitCount -gt $threshold) {
    Write-EventLog -LogName Application -Source "TCP Monitor" -EntryType Warning -EventId 1001 -Message "TIME_WAIT connections exceeded threshold: $timeWaitCount"
    
    # Optional: Increase dynamic port range temporarily
    netsh int ipv4 set dynamicport tcp start=10000 num=55535
}

For Apache httpd, adjust these settings in httpd.conf:

KeepAlive On
KeepAliveTimeout 5
MaxKeepAliveRequests 100

# For Tomcat in AJP connector
<Connector port="8009" protocol="AJP/1.3"
    connectionTimeout="20000"
    maxThreads="500"
    tcpNoDelay="true"
    socket.soLingerOn="true"
    socket.soLingerTime="1"
    socket.keepAlive="true" />

If you absolutely must clear TIME_WAIT connections without rebooting, you can use this (risky) approach:

# WARNING: This will drop all connections including established ones
netsh int ipv4 reset
netsh int ipv6 reset
netsh winsock reset

# Then restart the TCP/IP service
sc stop tcpip
sc start tcpip

Remember that this is disruptive and should only be used in emergencies.

Implement these architectural improvements:

  • Upgrade to a newer Windows Server version with better TCP stack
  • Consider using connection pooling at application level
  • Implement proper connection termination in your application code
  • Monitor TIME_WAIT connections as part of your regular health checks