Optimizing Large File Transfers: Parallel Copy Methods on Linux Systems


2 views

When dealing with substantial file sizes (10-100GB range) on Linux systems, traditional sequential copying methods become inefficient. The need arises for parallel processing to maximize I/O throughput and reduce overall transfer time. While custom multi-threaded solutions exist, we'll explore native Linux approaches that don't require complex programming.

Linux provides several built-in methods for parallel file operations:

# Using GNU parallel with cp
find /source/dir -type f -print0 | parallel -0 -j 8 cp {} /dest/dir

# Using xargs with parallel processes
find /source/dir -type f -print0 | xargs -0 -P 4 -I % cp % /dest/dir

rsync offers a robust parallel mode that's particularly useful for large files:

rsync --recursive --links --perms --times \
      --progress --human-readable --stats \
      --files-from=file_list.txt \
      --info=progress2 \
      /source/ /destination/

The --info=progress2 flag provides aggregated progress statistics for parallel operations.

When implementing parallel copies:

  • Monitor disk I/O saturation with iostat -x 1
  • Adjust the parallel process count based on your storage subsystem
  • Consider using ionice for better I/O scheduling

For filesystems supporting reflinks (Btrfs, XFS, etc.), consider:

cp --reflink=always source_file destination_file

This creates lightweight copies while maintaining parallel operation benefits.

After parallel operations, verify file integrity:

find /source -type f -exec md5sum {} + | sort > source.md5
find /dest -type f -exec md5sum {} + | sort > dest.md5
diff source.md5 dest.md5

When dealing with numerous large files (10-100GB range) on Linux systems, traditional sequential copying methods become painfully slow. The standard cp command processes files one at a time, leaving your storage bandwidth underutilized.

Surprisingly, Linux offers several built-in tools for parallel operations:

1. GNU Parallel

The most straightforward solution using common utilities:


find /source/path -type f -name "*.large" | parallel -j 8 cp {} /destination/path/

This command:

  • Finds all target files
  • Feeds them to GNU Parallel
  • Executes up to 8 copy operations simultaneously (-j 8)

2. xargs with Parallel Execution

For systems without GNU Parallel:


find /source/path -type f -print0 | xargs -0 -P 4 -I {} cp {} /dest/

Key parameters:

  • -print0/-0: Handles filenames with spaces
  • -P 4: Runs 4 parallel processes

For massive file operations, consider these optimizations:

IO Scheduler Tuning


echo deadline > /sys/block/sdX/queue/scheduler

rsync with Parallel Transfer


parallel -j 8 rsync -a {} /destination/ ::: /source/file*

While parallel operations speed up transfers, beware of:

  • Disk seek times (HDD vs SSD performance)
  • Filesystem journaling overhead
  • Available memory for buffering

For best results on NVMe systems, I've found 8-16 parallel processes optimal, while HDDs typically benefit from 4-8 parallel operations.

When built-in tools aren't enough:


# Using fpart (file partitioner)
fpart -f /source/path -o filelist.txt
parallel -a filelist.txt -j 8 cp {} /dest/

Remember to verify copied files using checksums when dealing with critical data.