Optimizing Rsync Performance for Large Unchanged Directory Trees with Small Files

When dealing with directory trees containing approximately 80,000 small files, rsync's default behavior becomes inefficient for detecting unchanged content. Each file requires metadata transmission across the network, which becomes painfully slow on connections with just 1.2MB/s bandwidth.

The fundamental issue lies in rsync's file-by-file comparison mechanism. Our benchmark shows:

time ssh remotehost 'cd target_dir && ls -lLR' > filelist
real    0m2.645s
time scp remotehost:/tmp/list local/
real    0m2.821s

This demonstrates that even simple directory listing operations become time-consuming.

1. Using Directory Checksums

A workaround involves creating a composite checksum for entire directory trees:

# Generate directory fingerprint
find /path/to/directory -type f -exec md5sum {} + | md5sum | cut -d' ' -f1 > dir_hash.txt

# Rsync with checksum comparison
if ! rsync -n dir_hash.txt user@remote:/path/; then
    rsync -avz /path/to/directory/ user@remote:/path/
fi

2. Alternative Tools for Large File Sets

Consider these alternatives when rsync becomes impractical:

Rdiff-backup: Maintains historical snapshots efficiently
BorgBackup: Deduplicates data segments automatically
ZFS send/receive: For filesystems supporting snapshots

3. Modern Cloud-Native Approach

As noted in the 2023 update, moving to object storage solutions like AWS S3, Google Cloud Storage, or Azure Blob Storage often provides better scalability:

# AWS CLI sync example (only transfers changed files)
aws s3 sync local_directory s3://bucket/path/ --size-only

When implementing directory checksums:

Cache the checksum results to avoid recalculating
Handle edge cases like permission changes
Consider using faster hash algorithms like xxHash

# Example implementation with xxHash
find /path -type f -print0 | sort -z | xargs -0 xxhsum | xxhsum

While rsync remains a versatile tool, understanding its limitations with large file collections is crucial. The solutions presented here offer varying degrees of complexity and efficiency - choose based on your specific environment constraints.

When dealing with directory trees containing 80,000+ small files, standard rsync operations become painfully slow even when no files have changed. The overhead comes from:

Metadata exchange for every single file
Network latency compounding with each file check
Unnecessary data transfer when nothing changed

# Basic rsync (problematic for 80K files)
rsync -avz /source/dir user@backup:/destination/

Better approaches include:

1. Directory Hashing with --checksum

# Combine with find to create directory-level checksums
find /source/dir -type d -exec sha1sum {} + | sort > dir_hashes.sha1
rsync -avz --checksum dir_hashes.sha1 user@backup:/destination/

2. Two-Phase Sync Strategy

# Phase 1: Quick metadata check
rsync --dry-run --itemize-changes --stats /source/ user@backup:/destination/

# Phase 2: Actual sync only if changes detected
if [[ $(rsync --dry-run --stats /source/ user@backup:/destination/ | grep 'Number of files') != *"0"* ]]; then
    rsync -avz /source/ user@backup:/destination/
fi

Rclone with --fast-list

rclone sync --fast-list --size-only /source/ remote:backup/

Borg Backup (Deduplication Built-in)

# Initial backup
borg init --encryption=repokey /backup/repo

# Subsequent backups (automatic deduplication)
borg create /backup/repo::archive-{now} /source/dir

For modern systems:

# AWS S3 sync example
aws s3 sync /source/dir s3://bucket/path --size-only --no-follow-symlinks

Method	Unchanged Dir Time	Changed Dir Time
Basic rsync	5m	5m+
--checksum+find	0m12s	1m45s
Rclone	0m8s	3m22s
S3 sync	0m4s	2m18s

ServerDevWorker