Optimized Large-Scale File Transfer: Efficient Methods to Move Millions of Small MP3s Between Ubuntu Servers


4 views

When dealing with approximately 1 million MP3 files (average 300KB each) between Ubuntu servers, traditional methods like SCP become painfully slow (500KB/s). Single-file HTTP transfers show promising speeds (9-10MB/s), but the challenge lies in efficiently handling the entire dataset.

The primary issues with SCP in this scenario are:

  • Connection overhead per file
  • Encryption computational cost
  • Lack of parallel processing

1. rsync with Parallel Processing

# Install parallel if needed
sudo apt-get install parallel

# Create file list and process in parallel
find /source/path -name "*.mp3" | parallel -j 20 rsync -azR {} user@dest-server:/destination/path

2. HTTP-based Transfer with wget

# On source server (Python3 one-liner to serve current dir):
python3 -m http.server 8000

# On destination server:
wget -r -np -nc -R "index.html*" http://source-server:8000/

3. Tar + SSH Pipe Compression

tar cf - /source/path | pv | ssh user@dest-server "tar xf - -C /destination"

4. High-Performance Alternatives

  • bbfcp: Batch-enabled file copy tool
    bbfcp -s /source/path -d user@dest-server:/destination -w 20
  • UDR: UDP-based Data Transfer
    udr rsync -az /source/path user@dest-server:/destination
Method Transfer Rate 1M Files ETA
SCP 500KB/s ~7 days
rsync (serial) 2MB/s ~42 hours
Parallel rsync (20 jobs) 8MB/s ~10 hours
Tar+SSH pipe 12MB/s ~7 hours
UDR 15MB/s+ ~5 hours
  • Increase SSH connection limits in /etc/ssh/sshd_config:
    MaxStartups 100:30:200
    MaxSessions 200
  • Disable encryption if within secure network:
    rsync -az --rsh="ssh -T -c none -o Compression=no -x"
  • Use faster compression algorithms:
    tar -I 'zstd -T0' -cf - /source | ssh dest "tar -I zstd -xf -"

For truly massive datasets, consider a distributed approach:

# Split file list and process on multiple workers
split -n l/10 filelist.txt filelist_part_

# Execute transfers in parallel (using GNU parallel)
parallel -j 10 "rsync --files-from=filelist_part_{} /source user@dest:/dest" ::: {01..10}

Transferring a large volume of small files (like 1 million MP3s averaging 300KB each) between Ubuntu servers presents unique challenges. Traditional methods like SCP become painfully slow due to the overhead of establishing new connections for each file. While single-file HTTP transfers show promising speeds (9-10 MB/s), scaling this to bulk operations requires specialized tools.

SCP's bottleneck comes from its encryption overhead and serial file processing. Each file transfer requires:

  • SSH connection establishment
  • Cryptographic handshake
  • Individual file metadata transfer

This creates massive overhead when dealing with millions of files, even though each file is small.

1. rsync with Parallel Processing

The most efficient solution combines rsync with parallel processing:


# Install parallel if needed
sudo apt-get install parallel

# Create file list for parallel processing
find /source/dir -type f -name "*.mp3" > filelist.txt

# Parallel rsync transfer
cat filelist.txt | parallel -j 8 rsync -azR {} user@remote-server:/destination/dir

Key parameters:

  • -j 8: Number of parallel jobs (adjust based on CPU cores)
  • -a: Archive mode (preserves permissions, timestamps)
  • -z: Compression during transfer
  • -R: Preserves relative paths

2. Tar Over SSH Pipeline

For maximum throughput with minimal overhead:


# On source server:
tar -cf - /source/dir | pigz -c | ssh user@remote-server "pigz -dc | tar -xf - -C /destination"

This approach:

  • Creates a single tar stream
  • Compresses with pigz (parallel gzip)
  • Pipes directly to destination
  • Avoids individual file operations

3. High-Speed Alternatives

For extreme cases, consider:


# BBCP (point-to-point copy)
bbcp -s 16 -w 2M -P 5 /source/dir/*.mp3 user@remote-server:/destination

# UDR (UDP-based Data Transfer)
udr rsync -avz /source/dir/ user@remote-server:/destination
  • Increase SSH parallelism: Add -o ControlMaster=auto -o ControlPersist=60 to SSH config
  • Disable encryption: For internal networks, use -e "ssh -c none" with rsync
  • Batch processing: Split transfers into directories of 50,000 files each
  • Filesystem tuning: On destination, use noatime,nodiratime mount options

For long-running transfers, monitor with:


# Progress viewer for rsync
rsync -avz --progress /source/dir/ user@remote-server:/destination

# Alternative progress tool
pv filelist.txt | parallel -j 8 rsync -azR {} user@remote-server:/destination