Debugging and Fixing Indefinite Hanging in Network Read Operations on Linux (Python/S3/SVN Cases)

When dealing with network operations in Linux environments (specifically Debian Wheezy), processes occasionally hang indefinitely during read operations. This manifests in two primary scenarios:

// Common strace outputs showing the hang
$ strace -p 12089
Process 12089 attached - interrupt to quit
read(5, 

$ strace -p 17527  
Process 17527 attached - interrupt to quit
recvfrom(3,

The issue appears across different protocols and tools:

Python scripts downloading from S3 (using urllib/urllib2)
SVN operations with externals (svn:// protocol)
Both Python 2.5 and 2.7 environments

Several important characteristics of this behavior:

// Network connections remain established
$ sudo lsof -i | grep 12089
python  12089    user    5u  IPv4 809917771      0t0  TCP my.server.net:35427->185-201.amazon.com:https (ESTABLISHED)

// Timeouts don't always help
import socket
socket.setdefaulttimeout(60)  # Still hangs sometimes

First, check fundamental network health:

# Check for packet drops
ifconfig | grep dropped

# Verify TCP keepalive settings
cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_probes
cat /proc/sys/net/ipv4/tcp_keepalive_intvl

For S3 downloads, implement robust timeout handling:

import urllib2
import socket

class RobustS3Downloader:
    def __init__(self, timeout=30, retries=3):
        self.timeout = timeout
        self.retries = retries
        
    def download(self, url):
        last_error = None
        for attempt in range(self.retries):
            try:
                req = urllib2.Request(url)
                # Set both socket and urllib2 timeouts
                socket.setdefaulttimeout(self.timeout)
                return urllib2.urlopen(req, timeout=self.timeout).read()
            except (urllib2.URLError, socket.timeout) as e:
                last_error = e
                continue
        raise last_error

For hanging SVN operations, consider these approaches:

# 1. Use timeout wrapper
timeout 300 svn up --non-interactive

# 2. Alternative protocols
svn checkout http://... instead of svn://

# 3. Check externals configuration
svn propget svn:externals .

When processes hang, gather more diagnostic data:

# Check TCP connection state
ss -tnp | grep <pid>

# Network buffer inspection
cat /proc/<pid>/net/tcp | grep -i "<port>"

# Kernel stack trace
echo w > /proc/sysrq-trigger
dmesg | tail -n 30

System-wide configuration changes that can help:

# Adjust TCP keepalive settings (add to /etc/sysctl.conf)
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 10

# Apply changes
sysctl -p

# Limit process memory to trigger failsafes earlier
ulimit -v 500000  # 500MB

Consider more robust HTTP clients:

# Using requests with proper timeout
import requests

try:
    response = requests.get(s3_url, timeout=(10, 30))  # (connect, read)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Download failed: {e}")

When dealing with network operations in Linux (specifically Debian Wheezy), we've encountered processes that hang indefinitely during read operations. Through strace analysis, we consistently see processes stuck at:

recvfrom(3,

read(5,

The issue manifests across multiple scenarios:

Python scripts downloading from S3 (both with and without explicit timeouts)
SVN operations via subprocess.Popen
Different network endpoints (Amazon S3, telecommunity.com)

Checking network connections of hung processes reveals established TCP connections:

lsof -i | grep <pid>
python  12089 user 5u IPv4 809917771 TCP my.server.net:35427->185-201.amazon.com:https (ESTABLISHED)

For Python network operations, implement comprehensive timeout protection:

import socket
import urllib2

# Global socket timeout
socket.setdefaulttimeout(60)

# Per-request timeout with urllib2
req = urllib2.Request(url, timeout=30)
try:
    response = urllib2.urlopen(req)
except socket.timeout:
    # Handle timeout
    pass

For more robust solutions, consider using requests with proper session management:

import requests
from requests.adapters import HTTPAdapter

s = requests.Session()
adapter = HTTPAdapter(max_retries=3, 
                     pool_connections=10,
                     pool_maxsize=10)
s.mount('http://', adapter)
s.mount('https://', adapter)

try:
    r = s.get(url, timeout=(3.05, 30))
except requests.exceptions.Timeout:
    # Handle timeout
    pass

Adjust TCP keepalive settings in /etc/sysctl.conf:

# Enable TCP keepalive
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 10

# Reduce TIME_WAIT period
net.ipv4.tcp_fin_timeout = 30

Apply changes with sysctl -p.

For SVN or other subprocess operations, implement timeouts:

import subprocess
import threading

def run_command(cmd, timeout_sec):
    proc = subprocess.Popen(cmd)
    timer = threading.Timer(timeout_sec, proc.kill)
    try:
        timer.start()
        proc.communicate()
    finally:
        timer.cancel()

Implement a watchdog for long-running network operations:

import signal

class Timeout:
    def __init__(self, seconds=1, error_message='Timeout'):
        self.seconds = seconds
        self.error_message = error_message
    
    def handle_timeout(self, signum, frame):
        raise TimeoutError(self.error_message)
    
    def __enter__(self):
        signal.signal(signal.SIGALRM, self.handle_timeout)
        signal.alarm(self.seconds)
    
    def __exit__(self, type, value, traceback):
        signal.alarm(0)

# Usage:
try:
    with Timeout(seconds=30):
        # Network operation here
        pass
except TimeoutError:
    # Handle timeout
    pass

Consider async solutions for network-bound operations:

import asyncio
import aiohttp

async def fetch(session, url):
    try:
        async with session.get(url, timeout=30) as response:
            return await response.text()
    except asyncio.TimeoutError:
        print(f"Timeout occurred for {url}")
        return None

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://example.com')
        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

ServerDevWorker

Debugging and Fixing Indefinite Hanging in Network Read Operations on Linux (Python/S3/SVN Cases)

Related Articles