When dealing with 800,000+ files across 4,000 folders in a shallow (max 2-level) directory structure, traditional sync tools like rsync hit fundamental scalability limitations. The 12-hour file listing phase becomes unacceptable when daily changes represent less than 0.5% of total files (a few thousand updates and 1-2K new files).
The core issue stems from how traditional tools traverse filesystems:
// Classic recursive traversal (problematic at scale)
void traverse(File dir) {
File[] children = dir.listFiles();
for (File f : children) {
if (f.isDirectory()) {
traverse(f);
} else {
// Stat each file individually
}
}
}
This O(n) operation becomes expensive when n=800,000, especially over network mounts.
For enterprise-scale sync, consider these architectural patterns:
1. Change Journal-Based Synchronization
Leverage filesystem change journals (NTFS USN Journal, inotify, etc.) to track modifications:
# Linux inotifywait example for monitoring changes
inotifywait -m -r --format '%w%f' --event modify,create,delete /data |
while read file; do
echo "$(date): $file changed" >> /var/log/sync_changes.log
done
2. Database-Backed Indexing
Maintain a persistent file metadata database:
-- SQL schema example for tracking sync state
CREATE TABLE file_manifest (
path VARCHAR(1024) PRIMARY KEY,
last_modified BIGINT,
checksum VARCHAR(64),
sync_status ENUM('pending','synced','failed')
);
-- Query for changed files becomes instantaneous
SELECT path FROM file_manifest
WHERE last_modified > [last_sync_time];
Consider these specialized tools:
1. PeerSync Enterprise
Uses a combination of:
- Persistent file catalogs with binary tree indexing
- Multi-threaded differential transfers
- Configurable comparison methods (timestamps, checksums, or both)
2. Syncthing with Modifications
The open-source solution can be enhanced for large deployments:
// Custom scanner implementation for large repos
type fastScanner struct {
db *bolt.DB // Embedded key-value store
ignoreDirs map[string]struct{}
}
func (s *fastScanner) Walk() error {
// Implement breadth-first traversal
// Store metadata in local DB
}
A financial institution solved similar challenges with:
- Initial full sync using parallelized robocopy
- Daily diffs via a custom service reading Windows USN Journal
- Compressed batch transfers using zstandard
Their sync window reduced from 9 hours to 23 minutes.
Technique | Impact | Implementation |
---|---|---|
Breadth-first traversal | 50-70% faster scanning | Use queue instead of recursion |
File system journals | Eliminates full scans | Read USN/inotify logs |
Persistent metadata | Instant change detection | SQLite/BoltDB storage |
When dealing with file synchronization at enterprise scale (800,000+ files across 4,000 folders), traditional tools like rsync often fail to meet performance requirements. The core issue stems from:
- Excessive metadata processing during the "building file list" phase (8-12 hours in reported cases)
- Inefficient change detection mechanisms for shallow directory structures (2 levels deep)
- Memory limitations when tracking file states
Based on the described environment, we need synchronization that:
1. Processes 3K-5K daily file changes (updates + new files)
2. Completes within 1-hour timeframe
3. Handles 800K initial file inventory
4. Maintains accurate change detection
5. Supports scheduled daily execution
Tool | Pros | Cons | Performance |
---|---|---|---|
Rsync | Widely available, reliable | Slow file listing, no database | 8-12 hours |
RepliWeb | Faster incremental | False deletions at scale | 45 minutes |
Unison | Bi-directional sync | Complex setup | 2-3 hours |
For this scale, I recommend a database-backed synchronizer with these components:
# Sample Python pseudocode for custom solution
import sqlite3
from pathlib import Path
class FileSync:
def __init__(self):
self.db = sqlite3.connect('sync_state.db')
self._init_db()
def _init_db(self):
self.db.execute('''CREATE TABLE IF NOT EXISTS file_states
(path TEXT PRIMARY KEY, mtime REAL, size INT)''')
def scan_files(self, root_path):
for f in Path(root_path).rglob('*'):
if f.is_file():
stat = f.stat()
self.db.execute('''INSERT OR REPLACE INTO file_states
VALUES (?, ?, ?)''',
(str(f), stat.st_mtime, stat.st_size))
For immediate implementation consider:
- Lsyncd: Uses inotify for real-time changes with rsync fallback
lsyncd -rsync /source/ user@remote:/destination/
- Bvckup 2: Commercial solution with delta copying
- Robocopy (Windows): Multi-threaded with robust retry logic
robocopy \\source\share \\dest\share /MIR /ZB /MT:16 /R:1 /W:1
Key improvements for existing tools:
- Rsync with file list:
find /source -type f -mtime -1 > filelist.txt rsync -av --files-from=filelist.txt /source/ user@remote:/dest/
- Parallel execution:
parallel -j 8 rsync -a {} user@remote:/dest/ ::: /source/*
Essential metrics to track:
# Sample Prometheus metrics
sync_duration_seconds{operation="full"} 2387
sync_files_processed{type="new"} 1842
sync_errors_total{type="permission"} 3
sync_transfer_bytes 18542940348