How to Compare Contents of a Tar Archive with Local Directory in Linux


11 views

When working with system backups, it's common to need verification between archived content and current filesystem state. The command tar -dvf archive.tar * fails because it doesn't properly handle directory comparison and exits on first error.

Here's a more robust approach that handles directories properly:


# Step 1: List files in tar archive
tar -tf archive.tar | sort > tar_contents.txt

# Step 2: List files in local directory
find $HOME -type f | sed "s|$HOME/||" | sort > local_contents.txt

# Step 3: Compare the two lists
diff -u tar_contents.txt local_contents.txt | grep "^-[^-]" | sed 's/^-//'

For more detailed comparison including file attributes:


rsync -n -rc --dry-run --itemize-changes --exclude='*/' /path/to/extracted/tar/ $HOME/

For more control, here's a Python script:


import tarfile
import os

def compare_tar_to_local(tar_path, local_path):
    with tarfile.open(tar_path, 'r') as tar:
        tar_members = set(tar.getnames())
    
    local_files = set()
    for root, dirs, files in os.walk(local_path):
        for f in files:
            rel_path = os.path.relpath(os.path.join(root, f), local_path)
            local_files.add(rel_path)
    
    return tar_members - local_files

missing_files = compare_tar_to_local('archive.tar', os.path.expanduser('~'))
print("Files in tar but missing locally:")
print('\n'.join(sorted(missing_files)))

For large archives, consider using parallel processing:


find $HOME -type f | parallel --will-cite -j8 'test -f {} || echo "Missing: {}"'
  • Always test commands with --dry-run first
  • Consider file permissions and timestamps in your comparison
  • Handle symlinks carefully as they may point to different locations

When working with TAR archives as backups, we often need to verify if all files from the archive exist in our local directory structure. The naive approach of using tar -dvf fails because:

  • It stops at first missing directory
  • Doesn't provide comprehensive comparison
  • Exit status isn't helpful for automation

Method 1: Using tar and diff

First, list contents of both locations:


# List files in archive with relative paths
tar -tf backup.tar | sort > archive_files.txt

# List local files (excluding the archive itself)
find $HOME -path "$HOME/backup.tar" -prune -o -type f -printf "%P\n" | sort > local_files.txt

# Compare the two lists
diff -u archive_files.txt local_files.txt | grep "^+[^+]"

Method 2: Python Script for Detailed Comparison

For more control, use this Python script:


import tarfile
import os
from pathlib import Path

home = str(Path.home())
archive_path = os.path.join(home, 'backup.tar')

def get_archive_files():
    with tarfile.open(archive_path) as tar:
        return set(m.name for m in tar.getmembers() if m.isfile())

def get_local_files():
    local_files = set()
    for root, _, files in os.walk(home):
        if root == home and 'backup.tar' in files:
            continue
        for file in files:
            rel_path = os.path.relpath(os.path.join(root, file), home)
            local_files.add(rel_path)
    return local_files

missing_in_local = get_archive_files() - get_local_files()
print("Files in archive missing locally:")
print("\n".join(sorted(missing_in_local)))

For thorough validation including content checks:


# Generate checksums for archive
tar -xf backup.tar --to-command='sha1sum | sed "s/-$/$TAR_FILENAME/"' > archive_checksums.txt

# Generate checksums for local files
find $HOME -type f -exec sha1sum {} + | sed "s|$HOME/||" > local_checksums.txt

# Compare
comm -23 <(sort archive_checksums.txt) <(sort local_checksums.txt)
  • Symbolic links may cause false positives
  • Permissions and ownership aren't checked by default
  • Hidden files (dotfiles) are often overlooked
  • Case sensitivity issues on certain filesystems

For regular backups, consider setting up a cron job:


0 3 * * * /usr/bin/python3 /path/to/compare_script.py >> /var/log/backup_verify.log