TCP Connection Reset Analysis: Diagnosing RST Packets in Load Balancer Architectures


9 views

When debugging connection resets in load-balanced environments, packet-level analysis becomes crucial. Your observation shows RST packets appearing with different source IPs depending on capture location - this is characteristic of middlebox interference.

Here's what's happening in your architecture:

Client (1.1.1.1) → Load Balancer (2.2.2.2) → Server (3.3.3.3)

The asymmetric RST visibility occurs because:

  • When LB initiates reset: Source=LB_IP (2.2.2.2) visible to client
  • Server sees the connection as client-originated (1.1.1.1)

Use this Python snippet to detect RST packets:

from scapy.all import *

def detect_rst(pkt):
    if TCP in pkt and pkt[TCP].flags & 0x04:  # RST flag
        print(f"RST from {pkt[IP].src} to {pkt[IP].dst}")

sniff(filter="tcp", prn=detect_rst)

These settings often cause premature connection termination:

# Nginx example (too aggressive timeouts)
proxy_connect_timeout 5m;
proxy_send_timeout 5m; 
proxy_read_timeout 5m;
keepalive_timeout 5m;

Configure servers to maintain connections:

# Linux sysctl settings
echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes

For comprehensive analysis:

# Client-side filter
tcp.flags.reset == 1 && ip.src == your_lb_ip

# Server-side filter 
tcp.flags.reset == 1 && tcp.port == your_app_port

For AWS ALB/ELB:

# Set idle timeout (default 60s)
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn your_arn \
  --attributes Key=idle_timeout.timeout_seconds,Value=300

When dealing with load balancers (LBs) and TCP connections, one common issue is unexpected connection resets (RST packets) during idle periods. In your case, you observed:

  • 3 backend servers behind an LB
  • Connections dropping after 5 minutes of inactivity
  • RST packets appearing in Wireshark captures
  • Conflicting source IPs in client/server packet captures

Most LBs operate in one of these modes:

1. Transparent Proxy (Layer 4):
   - Preserves original client IP
   - Forwards packets unchanged

2. Application Proxy (Layer 7):
   - Terminates TCP connection
   - Creates new connection to backend
   - Modifies packet headers

Your observations reveal an important behavior:

Client-side capture: Shows LB IP as RST source
Server-side capture: Shows client IP as RST source

This suggests your LB is:

  1. Receiving RST from backend server
  2. Forwarding it to client while maintaining the illusion of direct connection

Here's how to verify LB behavior using Python socket programming:

import socket
from time import sleep

def test_connection(host, port):
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.settimeout(300)  # 5 minute timeout
        s.connect((host, port))
        print("Connected - now idling...")
        sleep(350)  # Wait longer than timeout
        s.send(b"PING")  # Should fail if connection reset
    except ConnectionResetError:
        print("Connection was reset by peer")

To prevent unwanted resets:

# Server-side keepalive configuration (Linux example)
echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes

For cloud LBs (AWS ALB example):

aws elbv2 modify-target-group-attributes \
    --target-group-arn YOUR_TG_ARN \
    --attributes Key=deregistration_delay.timeout_seconds,Value=600

Use this tcpdump command to monitor RST packets:

tcpdump -i any 'tcp[tcpflags] & (tcp-rst) != 0' -nn -v

Key things to verify:

  • Sequence numbers match between client/server captures
  • Timestamps correlate with idle timeout periods
  • TTL values to identify network hops

Different LBs handle this differently:

LB Type RST Behavior
AWS ALB Sends TCP RST to client when backend fails
Nginx Can be configured with proxy_timeout
HAProxy Uses timeout server setting