Robust Large File Transfer Solutions for Unstable Networks: Chunk Upload with Retry Mechanism


2 views

When dealing with large file transfers (30+ minutes per file) over unreliable broadband connections, traditional tools like SCP often prove inadequate. The most frustrating scenario isn't outright failure - it's when transfers appear to continue running but actually stall without any error notification. This silent failure mode wastes significant time and resources.

A naive approach might involve wrapping SCP in a retry loop, but this fails to address the core issues:


# Problematic approach (don't use this)
while ! scp largefile.dat user@remote:/path/; do
    echo "Transfer failed, retrying..."
    sleep 5
done

This doesn't handle partial transfers or detect stalls - it only retries after complete failures.

The proper solution combines three key techniques:

1. File Chunking

Split files into manageable pieces (e.g., 100MB chunks):


# Split file into 100MB chunks
split -b 100M largefile.dat largefile_part_

# Reassemble on remote server
cat largefile_part_* > largefile.dat

2. Reliable Transfer Protocol

Consider these alternatives to SCP:

  • rsync: Built-in partial transfer resumption
  • lftp: Advanced retry logic and parallel transfers
  • rclone: Cloud-oriented but works with SFTP

3. Stalled Transfer Detection

Implement active monitoring for transfer stalls:


# Example using rsync with progress monitoring
rsync --progress --timeout=300 --partial \
      --checksum largefile.dat user@remote:/path/

# Alternative with lftp
lftp -e "set net:reconnect-interval-base 60; \
          set net:max-retries 10; \
          put largefile.dat -o /remote/path/; \
          exit" sftp://user@remote

Here's a comprehensive bash script that implements all best practices:


#!/bin/bash

# Configuration
CHUNK_SIZE=100M
MAX_RETRIES=5
TIMEOUT=300
REMOTE="user@remote:/path/"

# Split file
echo "Splitting file into chunks..."
split -b $CHUNK_SIZE "$1" "${1}_part_"

# Transfer each chunk with retries
for chunk in "${1}_part_"*; do
    retry=0
    while [ $retry -lt $MAX_RETRIES ]; do
        echo "Transferring $chunk (attempt $((retry+1)))"
        timeout $TIMEOUT rsync --progress --partial \
                               --checksum "$chunk" "$REMOTE"
        
        if [ $? -eq 0 ]; then
            echo "Chunk transferred successfully"
            break
        fi
        
        echo "Transfer failed, retrying..."
        ((retry++))
        sleep $((retry * 10))
    done
    
    if [ $retry -eq $MAX_RETRIES ]; then
        echo "ERROR: Failed to transfer $chunk after $MAX_RETRIES attempts"
        exit 1
    fi
done

echo "All chunks transferred successfully"

For production environments, consider these specialized tools:

  • Aspera: Commercial high-speed transfer protocol
  • BBCP
  • UDR: UDP-based data transfer

Remember to verify file integrity after transfer using checksums:


# Generate checksum
md5sum largefile.dat > largefile.md5

# Verify on remote
md5sum -c largefile.md5

When transferring multi-gigabyte files across unreliable connections, traditional tools like SCP reveal critical limitations:

scp -P 22 large_file.dat user@remote:/path/

The connection may freeze without proper timeout detection - the process keeps running but makes zero progress. TCP keepalives often fail to catch this "zombie transfer" state.

Splitting files into smaller segments provides multiple advantages:

  • Individual failed chunks can be retried independently
  • Transfer progress can be precisely tracked
  • Bandwidth fluctuations impact smaller units
Tool Protocol Resume Chunking
rsync SSH/RSYNC Yes Delta only
lftp FTP/HTTP Yes Manual
aria2 Multi-protocol Yes Auto
bbftp FTP Yes Configurable

This bash script implements robust chunked transfers:

#!/bin/bash
# Split file into 100MB chunks
split -b 100M large_file.dat chunk_

# Transfer with automatic retry
lftp -u user,pass sftp://server < large_file.dat && rm chunk_*'

To detect frozen SCP transfers, wrap it in a timeout with progress monitoring:

timeout 3600 scp -o ServerAliveInterval=60 \
                 -o ServerAliveCountMax=5 \
                 large_file.dat user@remote:/path/

# Verify completion
if [ $? -eq 124 ]; then
  echo "Transfer timed out - implement resume logic"
fi

For mission-critical transfers, consider UDP-based protocols:

# Aspera CLI example
ascp --policy=fair \
     --target-rate=50M \
     --mode=send \
     large_file.dat \
     user@remote:/path/