Optimizing Large-Scale File Synchronization: Solutions for 800K+ Files with High Efficiency


3 views

When dealing with 800,000+ files across 4,000 folders in a shallow (max 2-level) directory structure, traditional sync tools like rsync hit fundamental scalability limitations. The 12-hour file listing phase becomes unacceptable when daily changes represent less than 0.5% of total files (a few thousand updates and 1-2K new files).

The core issue stems from how traditional tools traverse filesystems:


// Classic recursive traversal (problematic at scale)
void traverse(File dir) {
    File[] children = dir.listFiles();
    for (File f : children) {
        if (f.isDirectory()) {
            traverse(f); 
        } else {
            // Stat each file individually
        }
    }
}

This O(n) operation becomes expensive when n=800,000, especially over network mounts.

For enterprise-scale sync, consider these architectural patterns:

1. Change Journal-Based Synchronization

Leverage filesystem change journals (NTFS USN Journal, inotify, etc.) to track modifications:


# Linux inotifywait example for monitoring changes
inotifywait -m -r --format '%w%f' --event modify,create,delete /data | 
while read file; do
    echo "$(date): $file changed" >> /var/log/sync_changes.log
done

2. Database-Backed Indexing

Maintain a persistent file metadata database:


-- SQL schema example for tracking sync state
CREATE TABLE file_manifest (
    path VARCHAR(1024) PRIMARY KEY,
    last_modified BIGINT,
    checksum VARCHAR(64),
    sync_status ENUM('pending','synced','failed')
);

-- Query for changed files becomes instantaneous
SELECT path FROM file_manifest 
WHERE last_modified > [last_sync_time];

Consider these specialized tools:

1. PeerSync Enterprise

Uses a combination of:

  • Persistent file catalogs with binary tree indexing
  • Multi-threaded differential transfers
  • Configurable comparison methods (timestamps, checksums, or both)

2. Syncthing with Modifications

The open-source solution can be enhanced for large deployments:


// Custom scanner implementation for large repos
type fastScanner struct {
    db         *bolt.DB // Embedded key-value store
    ignoreDirs map[string]struct{}
}

func (s *fastScanner) Walk() error {
    // Implement breadth-first traversal
    // Store metadata in local DB
}

A financial institution solved similar challenges with:

  1. Initial full sync using parallelized robocopy
  2. Daily diffs via a custom service reading Windows USN Journal
  3. Compressed batch transfers using zstandard

Their sync window reduced from 9 hours to 23 minutes.

Technique Impact Implementation
Breadth-first traversal 50-70% faster scanning Use queue instead of recursion
File system journals Eliminates full scans Read USN/inotify logs
Persistent metadata Instant change detection SQLite/BoltDB storage

When dealing with file synchronization at enterprise scale (800,000+ files across 4,000 folders), traditional tools like rsync often fail to meet performance requirements. The core issue stems from:

  • Excessive metadata processing during the "building file list" phase (8-12 hours in reported cases)
  • Inefficient change detection mechanisms for shallow directory structures (2 levels deep)
  • Memory limitations when tracking file states

Based on the described environment, we need synchronization that:

1. Processes 3K-5K daily file changes (updates + new files)
2. Completes within 1-hour timeframe
3. Handles 800K initial file inventory
4. Maintains accurate change detection
5. Supports scheduled daily execution
Tool Pros Cons Performance
Rsync Widely available, reliable Slow file listing, no database 8-12 hours
RepliWeb Faster incremental False deletions at scale 45 minutes
Unison Bi-directional sync Complex setup 2-3 hours

For this scale, I recommend a database-backed synchronizer with these components:

# Sample Python pseudocode for custom solution
import sqlite3
from pathlib import Path

class FileSync:
    def __init__(self):
        self.db = sqlite3.connect('sync_state.db')
        self._init_db()
    
    def _init_db(self):
        self.db.execute('''CREATE TABLE IF NOT EXISTS file_states
                         (path TEXT PRIMARY KEY, mtime REAL, size INT)''')

    def scan_files(self, root_path):
        for f in Path(root_path).rglob('*'):
            if f.is_file():
                stat = f.stat()
                self.db.execute('''INSERT OR REPLACE INTO file_states
                                 VALUES (?, ?, ?)''',
                                 (str(f), stat.st_mtime, stat.st_size))

For immediate implementation consider:

  • Lsyncd: Uses inotify for real-time changes with rsync fallback
    lsyncd -rsync /source/ user@remote:/destination/
  • Bvckup 2: Commercial solution with delta copying
  • Robocopy (Windows): Multi-threaded with robust retry logic
    robocopy \\source\share \\dest\share /MIR /ZB /MT:16 /R:1 /W:1

Key improvements for existing tools:

  1. Rsync with file list:
    find /source -type f -mtime -1 > filelist.txt
    rsync -av --files-from=filelist.txt /source/ user@remote:/dest/
  2. Parallel execution:
    parallel -j 8 rsync -a {} user@remote:/dest/ ::: /source/*

Essential metrics to track:

# Sample Prometheus metrics
sync_duration_seconds{operation="full"} 2387
sync_files_processed{type="new"} 1842
sync_errors_total{type="permission"} 3
sync_transfer_bytes 18542940348