Optimal Methods for Directory Structure Comparison: Rsync Differential Analysis with Python Implementation


11 views

When working with backup systems using rsync, verifying directory consistency between source and destination is crucial. The requirement extends beyond simple file existence checks to comparing metadata like:

  • File modification timestamps
  • File sizes in bytes
  • Permission attributes
  • Checksums for critical verification

Rsync itself provides several built-in comparison modes:

# Dry-run with itemized changes
rsync -avun --delete source/ destination/

# Checksum comparison (slow but thorough)
rsync -avc --dry-run source/ destination/

However, these outputs require parsing for programmatic use.

For more control, here's a Python script using os.walk() and filecmp:

import os
import filecmp
from datetime import datetime

def compare_dirs(src, dst):
    diff_report = []
    for root, _, files in os.walk(src):
        rel_path = os.path.relpath(root, src)
        dst_path = os.path.join(dst, rel_path)
        
        for file in files:
            src_file = os.path.join(root, file)
            dst_file = os.path.join(dst_path, file)
            
            if not os.path.exists(dst_file):
                diff_report.append(f"{src_file} - MISSING IN DESTINATION")
                continue
                
            src_stat = os.stat(src_file)
            dst_stat = os.stat(dst_file)
            
            if src_stat.st_mtime != dst_stat.st_mtime or src_stat.st_size != dst_stat.st_size:
                diff_report.append(
                    f"{src_file} (mtime: {datetime.fromtimestamp(src_stat.st_mtime)}, size: {src_stat.st_size}) | "
                    f"{dst_file} (mtime: {datetime.fromtimestamp(dst_stat.st_mtime)}, size: {dst_stat.st_size}) | "
                    "MODIFIED"
                )
                
    return diff_report

For absolute certainty, implement MD5/SHA1 comparison:

import hashlib

def get_file_hash(filepath):
    hasher = hashlib.md5()
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hasher.update(chunk)
    return hasher.hexdigest()

# Add this to the comparison logic:
src_hash = get_file_hash(src_file)
dst_hash = get_file_hash(dst_file)
if src_hash != dst_hash:
    diff_report.append(f"CONTENT DIFF: {src_file} != {dst_file}")

When dealing with large directories:

  • Use multiprocessing for parallel hash calculations
  • Cache previous comparison results
  • Implement directory snapshotting for incremental checks
# Tree comparison
tree -Dugps /source > source_tree.txt
tree -Dugps /backup > backup_tree.txt
diff source_tree.txt backup_tree.txt

# Using specialized tools
sudo apt install meld  # GUI diff tool
dirdiff -r source/ backup/

When dealing with directory synchronization and backup validation, rsync stands out as one of the most powerful tools in a Unix/Linux administrator's arsenal. The beauty of rsync lies not just in its synchronization capabilities, but also in its verbose comparison output options.

rsync -avn --itemize-changes /source/path/ /destination/path/

This command provides detailed output showing exactly what would change during synchronization. The -n flag makes it a dry run, while --itemize-changes breaks down each comparison result.

Understanding rsync's output codes is crucial for accurate directory comparison:

>f..t...... file.txt
>f.s...... file2.txt
cd++++++++ new_dir/
  • f indicates a file comparison
  • . means the attribute is the same
  • t shows timestamp differences
  • s indicates size differences
  • + marks files/directories that would be created

For those needing more customized comparison logic, here's a Python solution using os.walk() and filecmp:

import os
import filecmp
from datetime import datetime

def compare_dirs(dir1, dir2):
    comparison = filecmp.dircmp(dir1, dir2)
    
    print("Files only in", dir1)
    for item in comparison.left_only:
        print(os.path.join(dir1, item))
    
    print("\nFiles only in", dir2)
    for item in comparison.right_only:
        print(os.path.join(dir2, item))
    
    print("\nCommon files with differences:")
    for item in comparison.diff_files:
        path1 = os.path.join(dir1, item)
        path2 = os.path.join(dir2, item)
        size1 = os.path.getsize(path1)
        size2 = os.path.getsize(path2)
        mtime1 = datetime.fromtimestamp(os.path.getmtime(path1))
        mtime2 = datetime.fromtimestamp(os.path.getmtime(path2))
        
        print(f"{path1} ({mtime1}) ({size1} bytes) | {path2} ({mtime2}) ({size2} bytes)")

# Example usage:
compare_dirs('/local/path', '/remote/path')

For absolute certainty in file comparison, even when metadata appears identical, consider adding SHA256 checksum verification:

import hashlib

def get_file_hash(filepath):
    hasher = hashlib.sha256()
    with open(filepath, 'rb') as f:
        while chunk := f.read(4096):
            hasher.update(chunk)
    return hasher.hexdigest()

# Add this to the comparison function:
hash1 = get_file_hash(path1)
hash2 = get_file_hash(path2)
print(f"Checksums: {hash1} vs {hash2}")

For those preferring GUI solutions:

  • Meld: Excellent graphical diff tool with directory comparison
  • Beyond Compare: Powerful commercial option with detailed reporting
  • KDiff3: Open-source alternative with merge capabilities

Command-line enthusiasts might prefer vimdiff for side-by-side file comparison, though it's less suited for entire directory structures.

For ongoing monitoring, consider setting up a cron job that logs differences:

0 2 * * * rsync -avn --itemize-changes /source/ /backup/ > /var/log/backup_diff_$(date +\%Y\%m\%d).log

This runs daily at 2 AM and saves output to dated log files for historical tracking.