Debugging and Fixing Indefinite Hanging in Network Read Operations on Linux (Python/S3/SVN Cases)


7 views

When dealing with network operations in Linux environments (specifically Debian Wheezy), processes occasionally hang indefinitely during read operations. This manifests in two primary scenarios:

// Common strace outputs showing the hang
$ strace -p 12089
Process 12089 attached - interrupt to quit
read(5, 

$ strace -p 17527  
Process 17527 attached - interrupt to quit
recvfrom(3,

The issue appears across different protocols and tools:

  • Python scripts downloading from S3 (using urllib/urllib2)
  • SVN operations with externals (svn:// protocol)
  • Both Python 2.5 and 2.7 environments

Several important characteristics of this behavior:

// Network connections remain established
$ sudo lsof -i | grep 12089
python  12089    user    5u  IPv4 809917771      0t0  TCP my.server.net:35427->185-201.amazon.com:https (ESTABLISHED)

// Timeouts don't always help
import socket
socket.setdefaulttimeout(60)  # Still hangs sometimes

First, check fundamental network health:

# Check for packet drops
ifconfig | grep dropped

# Verify TCP keepalive settings
cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_probes
cat /proc/sys/net/ipv4/tcp_keepalive_intvl

For S3 downloads, implement robust timeout handling:

import urllib2
import socket

class RobustS3Downloader:
    def __init__(self, timeout=30, retries=3):
        self.timeout = timeout
        self.retries = retries
        
    def download(self, url):
        last_error = None
        for attempt in range(self.retries):
            try:
                req = urllib2.Request(url)
                # Set both socket and urllib2 timeouts
                socket.setdefaulttimeout(self.timeout)
                return urllib2.urlopen(req, timeout=self.timeout).read()
            except (urllib2.URLError, socket.timeout) as e:
                last_error = e
                continue
        raise last_error

For hanging SVN operations, consider these approaches:

# 1. Use timeout wrapper
timeout 300 svn up --non-interactive

# 2. Alternative protocols
svn checkout http://... instead of svn://

# 3. Check externals configuration
svn propget svn:externals .

When processes hang, gather more diagnostic data:

# Check TCP connection state
ss -tnp | grep <pid>

# Network buffer inspection
cat /proc/<pid>/net/tcp | grep -i "<port>"

# Kernel stack trace
echo w > /proc/sysrq-trigger
dmesg | tail -n 30

System-wide configuration changes that can help:

# Adjust TCP keepalive settings (add to /etc/sysctl.conf)
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 10

# Apply changes
sysctl -p

# Limit process memory to trigger failsafes earlier
ulimit -v 500000  # 500MB

Consider more robust HTTP clients:

# Using requests with proper timeout
import requests

try:
    response = requests.get(s3_url, timeout=(10, 30))  # (connect, read)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Download failed: {e}")

When dealing with network operations in Linux (specifically Debian Wheezy), we've encountered processes that hang indefinitely during read operations. Through strace analysis, we consistently see processes stuck at:

recvfrom(3,

or

read(5,

The issue manifests across multiple scenarios:

  • Python scripts downloading from S3 (both with and without explicit timeouts)
  • SVN operations via subprocess.Popen
  • Different network endpoints (Amazon S3, telecommunity.com)

Checking network connections of hung processes reveals established TCP connections:

lsof -i | grep <pid>
python  12089 user 5u IPv4 809917771 TCP my.server.net:35427->185-201.amazon.com:https (ESTABLISHED)

For Python network operations, implement comprehensive timeout protection:

import socket
import urllib2

# Global socket timeout
socket.setdefaulttimeout(60)

# Per-request timeout with urllib2
req = urllib2.Request(url, timeout=30)
try:
    response = urllib2.urlopen(req)
except socket.timeout:
    # Handle timeout
    pass

For more robust solutions, consider using requests with proper session management:

import requests
from requests.adapters import HTTPAdapter

s = requests.Session()
adapter = HTTPAdapter(max_retries=3, 
                     pool_connections=10,
                     pool_maxsize=10)
s.mount('http://', adapter)
s.mount('https://', adapter)

try:
    r = s.get(url, timeout=(3.05, 30))
except requests.exceptions.Timeout:
    # Handle timeout
    pass

Adjust TCP keepalive settings in /etc/sysctl.conf:

# Enable TCP keepalive
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_intvl = 10

# Reduce TIME_WAIT period
net.ipv4.tcp_fin_timeout = 30

Apply changes with sysctl -p.

For SVN or other subprocess operations, implement timeouts:

import subprocess
import threading

def run_command(cmd, timeout_sec):
    proc = subprocess.Popen(cmd)
    timer = threading.Timer(timeout_sec, proc.kill)
    try:
        timer.start()
        proc.communicate()
    finally:
        timer.cancel()

Implement a watchdog for long-running network operations:

import signal

class Timeout:
    def __init__(self, seconds=1, error_message='Timeout'):
        self.seconds = seconds
        self.error_message = error_message
    
    def handle_timeout(self, signum, frame):
        raise TimeoutError(self.error_message)
    
    def __enter__(self):
        signal.signal(signal.SIGALRM, self.handle_timeout)
        signal.alarm(self.seconds)
    
    def __exit__(self, type, value, traceback):
        signal.alarm(0)

# Usage:
try:
    with Timeout(seconds=30):
        # Network operation here
        pass
except TimeoutError:
    # Handle timeout
    pass

Consider async solutions for network-bound operations:

import asyncio
import aiohttp

async def fetch(session, url):
    try:
        async with session.get(url, timeout=30) as response:
            return await response.text()
    except asyncio.TimeoutError:
        print(f"Timeout occurred for {url}")
        return None

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://example.com')
        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())