Optimal Methods for Directory Structure Comparison: Rsync Differential Analysis with Python Implementation


1 views

When working with backup systems using rsync, verifying directory consistency between source and destination is crucial. The requirement extends beyond simple file existence checks to comparing metadata like:

  • File modification timestamps
  • File sizes in bytes
  • Permission attributes
  • Checksums for critical verification

Rsync itself provides several built-in comparison modes:

# Dry-run with itemized changes
rsync -avun --delete source/ destination/

# Checksum comparison (slow but thorough)
rsync -avc --dry-run source/ destination/

However, these outputs require parsing for programmatic use.

For more control, here's a Python script using os.walk() and filecmp:

import os
import filecmp
from datetime import datetime

def compare_dirs(src, dst):
    diff_report = []
    for root, _, files in os.walk(src):
        rel_path = os.path.relpath(root, src)
        dst_path = os.path.join(dst, rel_path)
        
        for file in files:
            src_file = os.path.join(root, file)
            dst_file = os.path.join(dst_path, file)
            
            if not os.path.exists(dst_file):
                diff_report.append(f"{src_file} - MISSING IN DESTINATION")
                continue
                
            src_stat = os.stat(src_file)
            dst_stat = os.stat(dst_file)
            
            if src_stat.st_mtime != dst_stat.st_mtime or src_stat.st_size != dst_stat.st_size:
                diff_report.append(
                    f"{src_file} (mtime: {datetime.fromtimestamp(src_stat.st_mtime)}, size: {src_stat.st_size}) | "
                    f"{dst_file} (mtime: {datetime.fromtimestamp(dst_stat.st_mtime)}, size: {dst_stat.st_size}) | "
                    "MODIFIED"
                )
                
    return diff_report

For absolute certainty, implement MD5/SHA1 comparison:

import hashlib

def get_file_hash(filepath):
    hasher = hashlib.md5()
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hasher.update(chunk)
    return hasher.hexdigest()

# Add this to the comparison logic:
src_hash = get_file_hash(src_file)
dst_hash = get_file_hash(dst_file)
if src_hash != dst_hash:
    diff_report.append(f"CONTENT DIFF: {src_file} != {dst_file}")

When dealing with large directories:

  • Use multiprocessing for parallel hash calculations
  • Cache previous comparison results
  • Implement directory snapshotting for incremental checks
# Tree comparison
tree -Dugps /source > source_tree.txt
tree -Dugps /backup > backup_tree.txt
diff source_tree.txt backup_tree.txt

# Using specialized tools
sudo apt install meld  # GUI diff tool
dirdiff -r source/ backup/

When dealing with directory synchronization and backup validation, rsync stands out as one of the most powerful tools in a Unix/Linux administrator's arsenal. The beauty of rsync lies not just in its synchronization capabilities, but also in its verbose comparison output options.

rsync -avn --itemize-changes /source/path/ /destination/path/

This command provides detailed output showing exactly what would change during synchronization. The -n flag makes it a dry run, while --itemize-changes breaks down each comparison result.

Understanding rsync's output codes is crucial for accurate directory comparison:

>f..t...... file.txt
>f.s...... file2.txt
cd++++++++ new_dir/
  • f indicates a file comparison
  • . means the attribute is the same
  • t shows timestamp differences
  • s indicates size differences
  • + marks files/directories that would be created

For those needing more customized comparison logic, here's a Python solution using os.walk() and filecmp:

import os
import filecmp
from datetime import datetime

def compare_dirs(dir1, dir2):
    comparison = filecmp.dircmp(dir1, dir2)
    
    print("Files only in", dir1)
    for item in comparison.left_only:
        print(os.path.join(dir1, item))
    
    print("\nFiles only in", dir2)
    for item in comparison.right_only:
        print(os.path.join(dir2, item))
    
    print("\nCommon files with differences:")
    for item in comparison.diff_files:
        path1 = os.path.join(dir1, item)
        path2 = os.path.join(dir2, item)
        size1 = os.path.getsize(path1)
        size2 = os.path.getsize(path2)
        mtime1 = datetime.fromtimestamp(os.path.getmtime(path1))
        mtime2 = datetime.fromtimestamp(os.path.getmtime(path2))
        
        print(f"{path1} ({mtime1}) ({size1} bytes) | {path2} ({mtime2}) ({size2} bytes)")

# Example usage:
compare_dirs('/local/path', '/remote/path')

For absolute certainty in file comparison, even when metadata appears identical, consider adding SHA256 checksum verification:

import hashlib

def get_file_hash(filepath):
    hasher = hashlib.sha256()
    with open(filepath, 'rb') as f:
        while chunk := f.read(4096):
            hasher.update(chunk)
    return hasher.hexdigest()

# Add this to the comparison function:
hash1 = get_file_hash(path1)
hash2 = get_file_hash(path2)
print(f"Checksums: {hash1} vs {hash2}")

For those preferring GUI solutions:

  • Meld: Excellent graphical diff tool with directory comparison
  • Beyond Compare: Powerful commercial option with detailed reporting
  • KDiff3: Open-source alternative with merge capabilities

Command-line enthusiasts might prefer vimdiff for side-by-side file comparison, though it's less suited for entire directory structures.

For ongoing monitoring, consider setting up a cron job that logs differences:

0 2 * * * rsync -avn --itemize-changes /source/ /backup/ > /var/log/backup_diff_$(date +\%Y\%m\%d).log

This runs daily at 2 AM and saves output to dated log files for historical tracking.