Why Do Identical Tar Contents Produce Different MD5 Checksums? A Deep Dive into Tar Metadata and Timestamps


2 views

Recently while working on a backup script, I encountered a puzzling situation: two tar archives created from the exact same directory contents produced different MD5 checksums. This immediately raised red flags about data integrity verification in my automation pipeline.

The GNU tar utility, by default, includes several metadata fields in the archive header:


$ tar --list --verbose --file=archive1.tar
-rw-r--r-- user/user 1048576 2023-11-25 14:30 file1.txt
-rw-r--r-- user/user 2097152 2023-11-25 14:31 file2.txt

Notice the timestamps - these change with each archive creation, even if file contents remain identical.

Let's demonstrate this with a concrete example:


# First archive creation
$ tar -cf archive1.tar my_directory
$ md5sum archive1.tar
d41d8cd98f00b204e9800998ecf8427e

# Second archive (same contents)
$ tar -cf archive2.tar my_directory
$ md5sum archive2.tar
5d41402abc4b2a76b9719d911017c592

The MD5 differences stem from these tar-specific factors:

  • File modification timestamps in headers
  • UID/GID information of files
  • Archive creation time in global header
  • Optional fields like owner names

For reliable checksums, we need deterministic archives. Here are three approaches:

1. Using --mtime Parameter


$ tar --mtime="2023-01-01" -cf fixed_archive.tar my_directory

2. Setting POSIX Standard Format


$ tar --format=posix -cf posix_archive.tar my_directory

3. Checksumming File Contents Only

For content-only verification:


$ tar -xOf archive.tar | md5sum

In CI/CD pipelines, add these parameters to your tar commands:


# Example for reproducible builds
tar --sort=name \
    --mtime="@0" \
    --owner=0 --group=0 \
    --numeric-owner \
    -cf build_artifact.tar dist/

For critical systems, consider SHA-256 with content-based checks:


$ find my_directory -type f -exec sha256sum {} + | sort | sha256sum

When creating multiple tar archives from identical directory contents, you might encounter different MD5 checksums despite the file contents being unchanged. This occurs because tar includes metadata in its archive headers that can vary between archiving operations. The primary culprit is typically the file modification timestamp that gets embedded in each file's header within the tar archive.

Let's demonstrate this with a simple test case. First, create a test directory:

mkdir test_dir
echo "identical content" > test_dir/file1.txt
echo "more content" > test_dir/file2.txt

Now create two tar archives from the same directory:

tar -cf archive1.tar test_dir
sleep 1
tar -cf archive2.tar test_dir
md5sum archive*.tar

You'll notice different MD5 checksums for archive1.tar and archive2.tar, even though the file contents are identical.

The GNU tar utility stores several metadata fields for each file in the archive:

  • Modification time (mtime)
  • Access time (atime)
  • Change time (ctime)
  • UID/GID of the file owner
  • File permissions

Even minor changes to these metadata fields will result in different tar file contents, and consequently different MD5 checksums.

Option 1: Using --mtime Flag

The most straightforward solution is to force a consistent timestamp:

tar --mtime="2020-01-01" -cf archive_consistent.tar test_dir

Option 2: Using --sort with --mtime

For more comprehensive consistency, combine sorting with fixed timestamps:

tar --sort=name --mtime="2020-01-01" --owner=0 --group=0 --numeric-owner \
    -cf archive_fully_consistent.tar test_dir

Option 3: The Canonical Tar Approach

For absolute consistency, use this comprehensive set of flags:

tar --format=gnu --sort=name --mtime="2020-01-01" \
    --owner=0 --group=0 --numeric-owner \
    --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \
    -cf archive_canonical.tar test_dir

Create two archives with identical parameters and verify their checksums:

tar --mtime="2020-01-01" -cf archive_a.tar test_dir
tar --mtime="2020-01-01" -cf archive_b.tar test_dir
md5sum archive_{a,b}.tar

Now both archives should produce identical MD5 checksums.

If you need to verify content rather than archive structure, consider checksumming the extracted files:

mkdir temp
tar -xf archive.tar -C temp
find temp -type f -exec md5sum {} + | sort -k 2 | md5sum
rm -rf temp

Here's a bash function to create deterministic tar archives:

function deterministic_tar() {
    local dir="$1"
    local output="$2"
    tar --format=gnu --sort=name --mtime="2020-01-01" \
        --owner=0 --group=0 --numeric-owner \
        --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \
        -cf "$output" "$dir"
}