Recently while working on a backup script, I encountered a puzzling situation: two tar archives created from the exact same directory contents produced different MD5 checksums. This immediately raised red flags about data integrity verification in my automation pipeline.
The GNU tar utility, by default, includes several metadata fields in the archive header:
$ tar --list --verbose --file=archive1.tar
-rw-r--r-- user/user 1048576 2023-11-25 14:30 file1.txt
-rw-r--r-- user/user 2097152 2023-11-25 14:31 file2.txt
Notice the timestamps - these change with each archive creation, even if file contents remain identical.
Let's demonstrate this with a concrete example:
# First archive creation
$ tar -cf archive1.tar my_directory
$ md5sum archive1.tar
d41d8cd98f00b204e9800998ecf8427e
# Second archive (same contents)
$ tar -cf archive2.tar my_directory
$ md5sum archive2.tar
5d41402abc4b2a76b9719d911017c592
The MD5 differences stem from these tar-specific factors:
- File modification timestamps in headers
- UID/GID information of files
- Archive creation time in global header
- Optional fields like owner names
For reliable checksums, we need deterministic archives. Here are three approaches:
1. Using --mtime Parameter
$ tar --mtime="2023-01-01" -cf fixed_archive.tar my_directory
2. Setting POSIX Standard Format
$ tar --format=posix -cf posix_archive.tar my_directory
3. Checksumming File Contents Only
For content-only verification:
$ tar -xOf archive.tar | md5sum
In CI/CD pipelines, add these parameters to your tar commands:
# Example for reproducible builds
tar --sort=name \
--mtime="@0" \
--owner=0 --group=0 \
--numeric-owner \
-cf build_artifact.tar dist/
For critical systems, consider SHA-256 with content-based checks:
$ find my_directory -type f -exec sha256sum {} + | sort | sha256sum
When creating multiple tar archives from identical directory contents, you might encounter different MD5 checksums despite the file contents being unchanged. This occurs because tar includes metadata in its archive headers that can vary between archiving operations. The primary culprit is typically the file modification timestamp that gets embedded in each file's header within the tar archive.
Let's demonstrate this with a simple test case. First, create a test directory:
mkdir test_dir echo "identical content" > test_dir/file1.txt echo "more content" > test_dir/file2.txt
Now create two tar archives from the same directory:
tar -cf archive1.tar test_dir sleep 1 tar -cf archive2.tar test_dir md5sum archive*.tar
You'll notice different MD5 checksums for archive1.tar and archive2.tar, even though the file contents are identical.
The GNU tar utility stores several metadata fields for each file in the archive:
- Modification time (mtime)
- Access time (atime)
- Change time (ctime)
- UID/GID of the file owner
- File permissions
Even minor changes to these metadata fields will result in different tar file contents, and consequently different MD5 checksums.
Option 1: Using --mtime Flag
The most straightforward solution is to force a consistent timestamp:
tar --mtime="2020-01-01" -cf archive_consistent.tar test_dir
Option 2: Using --sort with --mtime
For more comprehensive consistency, combine sorting with fixed timestamps:
tar --sort=name --mtime="2020-01-01" --owner=0 --group=0 --numeric-owner \ -cf archive_fully_consistent.tar test_dir
Option 3: The Canonical Tar Approach
For absolute consistency, use this comprehensive set of flags:
tar --format=gnu --sort=name --mtime="2020-01-01" \ --owner=0 --group=0 --numeric-owner \ --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \ -cf archive_canonical.tar test_dir
Create two archives with identical parameters and verify their checksums:
tar --mtime="2020-01-01" -cf archive_a.tar test_dir tar --mtime="2020-01-01" -cf archive_b.tar test_dir md5sum archive_{a,b}.tar
Now both archives should produce identical MD5 checksums.
If you need to verify content rather than archive structure, consider checksumming the extracted files:
mkdir temp tar -xf archive.tar -C temp find temp -type f -exec md5sum {} + | sort -k 2 | md5sum rm -rf temp
Here's a bash function to create deterministic tar archives:
function deterministic_tar() { local dir="$1" local output="$2" tar --format=gnu --sort=name --mtime="2020-01-01" \ --owner=0 --group=0 --numeric-owner \ --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \ -cf "$output" "$dir" }