How to Implement Data Integrity Verification for ext4 Filesystems Before Backup


3 views

Unlike modern filesystems like btrfs or ZFS, ext4 doesn't natively support file-level checksums for data integrity verification. This creates a challenge when you need to validate data correctness before critical operations like backups. Here are some practical approaches to implement checksum verification on ext4:

The most straightforward method is to generate and store checksums for individual files:

# Generate SHA-256 checksums for all files in a directory
find /path/to/data -type f -exec sha256sum {} \; > checksums.sha256

# Verify checksums later
sha256sum -c checksums.sha256

For more sophisticated needs, consider implementing a database of file checksums:

#!/bin/bash
# Checksum database manager
DB_FILE="/var/lib/checksum_db"

verify_integrity() {
    while IFS= read -r line; do
        file=$(echo "$line" | awk '{print $2}')
        if [ ! -f "$file" ]; then
            echo "Missing: $file"
            continue
        fi
        echo "$line" | sha256sum -c --quiet 2>/dev/null || echo "Corrupt: $file"
    done < "$DB_FILE"
}

update_database() {
    find /path/to/data -type f -exec sha256sum {} \; > "$DB_FILE.tmp"
    mv "$DB_FILE.tmp" "$DB_FILE"
}

The BSD mtree format provides a robust way to track file attributes and checksums:

# Create mtree database
mtree -c -K cksum,sha256 -p /path/to/data > /backup/data.mtree

# Verify against database
mtree -f /backup/data.mtree -p /path/to/data

For mission-critical systems, consider these enterprise-grade solutions:

  • Implement a custom FUSE layer that transparently handles checksums
  • Use dm-verity with device mapper for block-level verification
  • Deploy auditd rules to monitor file modifications

Most modern backup tools support pre-backup verification hooks. Here's an example for BorgBackup:

#!/bin/bash
# Pre-backup verification script for Borg
if ! sha256sum -c /backup/checksums.sha256 >/dev/null 2>&1; then
    logger "Backup aborted: checksum verification failed"
    exit 1
fi

For large datasets, consider parallelizing checksum generation with GNU parallel:

find /data -type f | parallel -j8 sha256sum > checksums.sha256

Unlike modern filesystems like btrfs or ZFS, ext4 lacks built-in checksum functionality for data blocks. The filesystem only maintains checksums for metadata (since Linux 4.18 via the metadata_csum feature), leaving user data vulnerable to silent corruption.

Here are three reliable approaches to verify data integrity before backups on ext4:

1. File-level Hashing

# Generate SHA-256 checksums for all files
find /path/to/backup -type f -exec sha256sum {} + > checksums.txt

# Verify later
sha256sum -c checksums.txt

2. Block Device Verification

# Create binary checksums of raw device blocks
sudo dd if=/dev/sdX bs=1M | sha256sum > device_checksum.sha256

# Verification requires unmounting
sudo umount /dev/sdX
sudo dd if=/dev/sdX bs=1M | sha256sum -c device_checksum.sha256

For continuous protection, consider using Linux's device mapper integrity target:

# Setup dm-integrity on a block device
sudo integritysetup format /dev/sdX --integrity=hmac-sha256
sudo integritysetup open /dev/sdX int-sdX --integrity=hmac-sha256

# Create filesystem on the protected device
sudo mkfs.ext4 /dev/mapper/int-sdX

Here's a Python script that implements differential verification:

#!/usr/bin/env python3
import hashlib
import os
from pathlib import Path

def generate_checksums(directory):
    checksums = {}
    for filepath in Path(directory).rglob('*'):
        if filepath.is_file():
            with open(filepath, 'rb') as f:
                checksums[str(filepath)] = hashlib.sha256(f.read()).hexdigest()
    return checksums

def verify_checksums(original, current):
    for path, original_hash in original.items():
        if path not in current:
            print(f"File missing: {path}")
            continue
        if current[path] != original_hash:
            print(f"Checksum mismatch: {path}")

# Usage:
prev_state = generate_checksums('/backup/data')
# ... after some time ...
current_state = generate_checksums('/backup/data')
verify_checksums(prev_state, current_state)

When implementing verification for backup systems:

  • Store checksums separately from the backup data
  • Consider using par2 for redundancy
  • For rsync backups, use -c flag for checksum verification
  • Cloud storage users should enable object-level checksums