Optimizing Large File Transfers Over High-Latency WAN Links: Multithreaded TCP Solutions


1 views

When dealing with large file transfers (e.g., 160GB Oracle dumps) over high-speed WAN connections (100Mbps+) with latency (10ms+), traditional single-threaded TCP connections often underutilize available bandwidth due to:

  • TCP's congestion control algorithms
  • ACK delay constraints
  • Window size limitations

Initial tests with default settings showed abysmal 5Mbps throughput. Adjusting window size improved this to 45Mbps, but the real breakthrough came when using parallel connections:

# Single connection test
iperf -c remote_host -t 60 -i 10

# Parallel connection test (4 streams)
iperf -c remote_host -t 60 -i 10 -P 4

The parallel test achieved 25Mbps per stream, saturating the 100Mbps link.

Even with optimized TCP settings (window size=256KB, MTU=1500), single FTP transfers capped at 20Mbps. Parallel FTP transfers hit disk I/O bottlenecks:

  • Multiple concurrent disk seeks
  • No sequential read/write patterns
  • Excessive head movement on HDDs

We need tools that can:

  1. Split files into logical chunks
  2. Transfer chunks simultaneously
  3. Reassemble at destination
  4. Minimize disk thrashing

1. aria2 (Cross-platform)

aria2c --split=4 --max-connection-per-server=4 http://example.com/largefile.dmp

2. lftp (Linux-focused)

lftp -e "pget -n 4 -c ftp://example.com/largefile.dmp; quit"

3. Custom Python Implementation

import concurrent.futures
import requests

def download_chunk(url, start, end, chunk_num):
    headers = {'Range': f'bytes={start}-'}
    r = requests.get(url, headers=headers, stream=True)
    with open(f'chunk_{chunk_num}', 'wb') as f:
        for chunk in r.iter_content(1024):
            f.write(chunk)

def parallel_download(url, num_threads=4):
    file_size = int(requests.head(url).headers['Content-Length'])
    chunk_size = file_size // num_threads
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = []
        for i in range(num_threads):
            start = i * chunk_size
            end = start + chunk_size - 1 if i != num_threads -1 else ''
            futures.append(executor.submit(download_chunk, url, start, end, i))
        
        concurrent.futures.wait(futures)

When dealing with parallel transfers:

  • Use --file-allocation=prealloc in aria2
  • Consider ZFS or other advanced filesystems with good concurrent I/O
  • For HDDs, increase read-ahead: blockdev --setra 8192 /dev/sdX
  • Use --enable-mmap in lftp for memory mapping

For enterprise environments:

  • UDP-based protocols: Aspera, Tsunami UDP
  • Application-layer solutions: BBCP, GridFTP
  • Cloud services: AWS S3 multipart uploads

Calculate the break-even point using:

def should_ship_physically(file_size_gb, bandwidth_mbps, latency_ms, shipping_hours):
    transfer_time = (file_size_gb * 8000) / bandwidth_mbps  # in seconds
    effective_bandwidth = (file_size_gb * 8) / (transfer_time/3600)  # in Mbps
    
    # If shipping is faster than 2x transfer time
    return shipping_hours < (transfer_time/3600)/2

For our 160GB/100Mbps case: ~3.5 hours transfer vs. overnight shipping.


When transferring massive files like Oracle dumps (160GB+) over WAN links with 100Mbps bandwidth but 10ms latency, traditional single-threaded protocols hit fundamental TCP limitations. Our iperf tests revealed:

# Single-threaded iperf (default window size)
$ iperf -c remote_host
[  3]  0.0-10.0 sec  5.00 MBytes  5.00 Mbits/sec

# With tuned window size
$ iperf -c remote_host -w 512K
[  3]  0.0-10.0 sec  45.00 MBytes  45.00 Mbits/sec

# Multi-threaded (4 connections)
$ iperf -c remote_host -P 4
[SUM]  0.0-10.0 sec  100.00 MBytes  100.00 Mbits/sec

Even with optimal TCP settings (window scaling, MTU=1500), single-stream FTP transfers plateau at ~20Mbps due to:

  • ACK clocking delays in high-latency paths
  • TCP's congestion avoidance algorithms
  • Disk seek penalties during concurrent transfers

These tools implement BitTorrent-like chunking for single-source transfers:

1. BBCP (Advanced Multi-Stream Copy)

# Sender side (8 threads, 16MB chunks)
bbcp -P 8 -s 16 -w 8M -S "bbcp -P 8 -s 16" /path/to/dump.dmp user@remote:/dest/

# Key flags:
# -P: parallel streams
# -s: chunk size (tune based on RTT)
# -w: socket buffer size

2. UFTP (UDP-based File Transfer)

# Server (uses multicast UDP)
uftpd -m 224.1.2.3 -p 10432 /export/path

# Client (8 threads)
uftpc -t 8 -m 224.1.2.3 -p 10432 -o /dest/file.dmp

3. Aspera FASP (Commercial Alternative)

Example config for 100Mbps link:

# aspera.conf
fasp.bw_max_target=100M
fasp.transfer_priority=10
fasp.udp.max_datagram_size=1472

When parallelizing transfers, mitigate disk contention with:

# Linux: Use ionice for non-blocking I/O
ionice -c 2 -n 0 bbcp [options]

# Windows: Enable SMB Direct (RDMA)
Set-SmbClientConfiguration -EncryptionDisabled $true -Force

Essential sysctl adjustments for Linux servers:

# /etc/sysctl.conf
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_slow_start_after_idle = 0

Calculate the break-even point:

# Python bandwidth comparison
physical_transfer_time = (drive_prep + shipping_hours)
wan_transfer_time = (file_size_gb * 8) / (effective_wan_mbps)

if physical_transfer_time < wan_transfer_time:
    print("FedEx wins for files >", critical_size, "GB")