When dealing with approximately 1 million MP3 files (average 300KB each) between Ubuntu servers, traditional methods like SCP become painfully slow (500KB/s). Single-file HTTP transfers show promising speeds (9-10MB/s), but the challenge lies in efficiently handling the entire dataset.
The primary issues with SCP in this scenario are:
- Connection overhead per file
- Encryption computational cost
- Lack of parallel processing
1. rsync with Parallel Processing
# Install parallel if needed
sudo apt-get install parallel
# Create file list and process in parallel
find /source/path -name "*.mp3" | parallel -j 20 rsync -azR {} user@dest-server:/destination/path
2. HTTP-based Transfer with wget
# On source server (Python3 one-liner to serve current dir):
python3 -m http.server 8000
# On destination server:
wget -r -np -nc -R "index.html*" http://source-server:8000/
3. Tar + SSH Pipe Compression
tar cf - /source/path | pv | ssh user@dest-server "tar xf - -C /destination"
4. High-Performance Alternatives
- bbfcp: Batch-enabled file copy tool
bbfcp -s /source/path -d user@dest-server:/destination -w 20
- UDR: UDP-based Data Transfer
udr rsync -az /source/path user@dest-server:/destination
Method | Transfer Rate | 1M Files ETA |
---|---|---|
SCP | 500KB/s | ~7 days |
rsync (serial) | 2MB/s | ~42 hours |
Parallel rsync (20 jobs) | 8MB/s | ~10 hours |
Tar+SSH pipe | 12MB/s | ~7 hours |
UDR | 15MB/s+ | ~5 hours |
- Increase SSH connection limits in /etc/ssh/sshd_config:
MaxStartups 100:30:200 MaxSessions 200
- Disable encryption if within secure network:
rsync -az --rsh="ssh -T -c none -o Compression=no -x"
- Use faster compression algorithms:
tar -I 'zstd -T0' -cf - /source | ssh dest "tar -I zstd -xf -"
For truly massive datasets, consider a distributed approach:
# Split file list and process on multiple workers
split -n l/10 filelist.txt filelist_part_
# Execute transfers in parallel (using GNU parallel)
parallel -j 10 "rsync --files-from=filelist_part_{} /source user@dest:/dest" ::: {01..10}
Transferring a large volume of small files (like 1 million MP3s averaging 300KB each) between Ubuntu servers presents unique challenges. Traditional methods like SCP become painfully slow due to the overhead of establishing new connections for each file. While single-file HTTP transfers show promising speeds (9-10 MB/s), scaling this to bulk operations requires specialized tools.
SCP's bottleneck comes from its encryption overhead and serial file processing. Each file transfer requires:
- SSH connection establishment
- Cryptographic handshake
- Individual file metadata transfer
This creates massive overhead when dealing with millions of files, even though each file is small.
1. rsync with Parallel Processing
The most efficient solution combines rsync with parallel processing:
# Install parallel if needed
sudo apt-get install parallel
# Create file list for parallel processing
find /source/dir -type f -name "*.mp3" > filelist.txt
# Parallel rsync transfer
cat filelist.txt | parallel -j 8 rsync -azR {} user@remote-server:/destination/dir
Key parameters:
-j 8
: Number of parallel jobs (adjust based on CPU cores)-a
: Archive mode (preserves permissions, timestamps)-z
: Compression during transfer-R
: Preserves relative paths
2. Tar Over SSH Pipeline
For maximum throughput with minimal overhead:
# On source server:
tar -cf - /source/dir | pigz -c | ssh user@remote-server "pigz -dc | tar -xf - -C /destination"
This approach:
- Creates a single tar stream
- Compresses with pigz (parallel gzip)
- Pipes directly to destination
- Avoids individual file operations
3. High-Speed Alternatives
For extreme cases, consider:
# BBCP (point-to-point copy)
bbcp -s 16 -w 2M -P 5 /source/dir/*.mp3 user@remote-server:/destination
# UDR (UDP-based Data Transfer)
udr rsync -avz /source/dir/ user@remote-server:/destination
- Increase SSH parallelism: Add
-o ControlMaster=auto -o ControlPersist=60
to SSH config - Disable encryption: For internal networks, use
-e "ssh -c none"
with rsync - Batch processing: Split transfers into directories of 50,000 files each
- Filesystem tuning: On destination, use
noatime,nodiratime
mount options
For long-running transfers, monitor with:
# Progress viewer for rsync
rsync -avz --progress /source/dir/ user@remote-server:/destination
# Alternative progress tool
pv filelist.txt | parallel -j 8 rsync -azR {} user@remote-server:/destination