When dealing with directory trees containing approximately 80,000 small files, rsync's default behavior becomes inefficient for detecting unchanged content. Each file requires metadata transmission across the network, which becomes painfully slow on connections with just 1.2MB/s bandwidth.
The fundamental issue lies in rsync's file-by-file comparison mechanism. Our benchmark shows:
time ssh remotehost 'cd target_dir && ls -lLR' > filelist
real 0m2.645s
time scp remotehost:/tmp/list local/
real 0m2.821s
This demonstrates that even simple directory listing operations become time-consuming.
1. Using Directory Checksums
A workaround involves creating a composite checksum for entire directory trees:
# Generate directory fingerprint
find /path/to/directory -type f -exec md5sum {} + | md5sum | cut -d' ' -f1 > dir_hash.txt
# Rsync with checksum comparison
if ! rsync -n dir_hash.txt user@remote:/path/; then
rsync -avz /path/to/directory/ user@remote:/path/
fi
2. Alternative Tools for Large File Sets
Consider these alternatives when rsync becomes impractical:
- Rdiff-backup: Maintains historical snapshots efficiently
- BorgBackup: Deduplicates data segments automatically
- ZFS send/receive: For filesystems supporting snapshots
3. Modern Cloud-Native Approach
As noted in the 2023 update, moving to object storage solutions like AWS S3, Google Cloud Storage, or Azure Blob Storage often provides better scalability:
# AWS CLI sync example (only transfers changed files)
aws s3 sync local_directory s3://bucket/path/ --size-only
When implementing directory checksums:
- Cache the checksum results to avoid recalculating
- Handle edge cases like permission changes
- Consider using faster hash algorithms like xxHash
# Example implementation with xxHash
find /path -type f -print0 | sort -z | xargs -0 xxhsum | xxhsum
While rsync remains a versatile tool, understanding its limitations with large file collections is crucial. The solutions presented here offer varying degrees of complexity and efficiency - choose based on your specific environment constraints.
When dealing with directory trees containing 80,000+ small files, standard rsync operations become painfully slow even when no files have changed. The overhead comes from:
- Metadata exchange for every single file
- Network latency compounding with each file check
- Unnecessary data transfer when nothing changed
# Basic rsync (problematic for 80K files)
rsync -avz /source/dir user@backup:/destination/
Better approaches include:
1. Directory Hashing with --checksum
# Combine with find to create directory-level checksums
find /source/dir -type d -exec sha1sum {} + | sort > dir_hashes.sha1
rsync -avz --checksum dir_hashes.sha1 user@backup:/destination/
2. Two-Phase Sync Strategy
# Phase 1: Quick metadata check
rsync --dry-run --itemize-changes --stats /source/ user@backup:/destination/
# Phase 2: Actual sync only if changes detected
if [[ $(rsync --dry-run --stats /source/ user@backup:/destination/ | grep 'Number of files') != *"0"* ]]; then
rsync -avz /source/ user@backup:/destination/
fi
Rclone with --fast-list
rclone sync --fast-list --size-only /source/ remote:backup/
Borg Backup (Deduplication Built-in)
# Initial backup
borg init --encryption=repokey /backup/repo
# Subsequent backups (automatic deduplication)
borg create /backup/repo::archive-{now} /source/dir
For modern systems:
# AWS S3 sync example
aws s3 sync /source/dir s3://bucket/path --size-only --no-follow-symlinks
Method | Unchanged Dir Time | Changed Dir Time |
---|---|---|
Basic rsync | 5m | 5m+ |
--checksum+find | 0m12s | 1m45s |
Rclone | 0m8s | 3m22s |
S3 sync | 0m4s | 2m18s |