How to Recompress Existing ZFS Files After Changing Compression Algorithm from LZJB to LZ4


4 views

When working with long-lived ZFS pools, we often encounter this scenario: older files remain compressed with outdated algorithms (like LZJB) even after upgrading to superior methods (like LZ4). This creates suboptimal storage efficiency since LZ4 typically provides better compression ratios and faster operations.

Two common approaches exist, both with significant drawbacks:

# Option 1: Full pool backup and restore
zfs send pool/old | zfs receive pool/new
# Requires substantial temporary storage and downtime
# Option 2: Selective file recopying
find /pool/old -type f -mtime +3650 -exec cp {} /pool/new \;
# Risky due to potential filename/special character issues

While ZFS doesn't directly offer "recompress" functionality, we can leverage these technical approaches:

# Method 1: ZFS send/receive with compression forcing
zfs set compression=lz4 pool/dataset
zfs send pool/dataset@snap | zfs receive -F pool/dataset
# Method 2: In-place rewrite with compression
sudo zfs set compression=off pool/dataset
sudo dd if=/dev/zero of=/pool/dataset/zero bs=1M
sudo rm /pool/dataset/zero
sudo zfs set compression=lz4 pool/dataset

To analyze compression distribution across your dataset:

zdb -b pool | grep -E 'LZJB|LZ4'
# For detailed block-level analysis:
zdb -vvvv pool | grep -A 5 'COMPRESS'

When implementing this:

  • Always create snapshots before major operations
  • Monitor system load during recompression
  • Consider doing this during maintenance windows
  • Test with non-critical datasets first

For minimal downtime:

zfs bookmark pool/dataset@snap pool/dataset#mark
zfs send --compressed -i pool/dataset#mark | \
zfs receive -u -o compression=lz4 pool/dataset

Many ZFS administrators face this scenario: your pool was created years ago using lzjb compression, and later upgraded to lz4. Now you've got a mixed bag of compression formats with varying efficiency.

# Current compression setting (likely shows lz4 now)
zfs get compression poolname

The difference between lzjb and lz4 isn't just academic - in real-world datasets, lz4 typically provides:

  • 10-20% better compression ratios
  • 2-5x faster compression speeds
  • 3-10x faster decompression speeds

Option 1: ZFS Send/Receive (Recommended)

This creates new blocks with current compression settings:

# Create temporary snapshot
zfs snapshot poolname/dataset@compressfix

# Send with -w flag to preserve raw blocks
zfs send -w poolname/dataset@compressfix | zfs recv -F poolname/dataset_new

# Verify and replace
zfs rename poolname/dataset poolname/dataset_old
zfs rename poolname/dataset_new poolname/dataset

Option 2: In-place Rewrite

For cases where send/receive isn't feasible:

#!/bin/bash
# Rewrite files older than specific date
find /poolname/dataset -type f -mtime +3650 -print0 | while IFS= read -r -d '' file
do
    mv "$file" "$file.temp" && \
    cp -p "$file.temp" "$file" && \
    rm "$file.temp"
done

Checking Compression Distribution

While ZFS doesn't directly report per-algorithm stats, we can estimate:

# Compare compressed size vs logical size for old files
zfs list -o name,compressratio,used,logicalused | grep poolname

Automated Dataset Processing

For large environments, consider this Python approach:

import subprocess
import re

def get_datasets(pool):
    cmd = f"zfs list -H -o name -r {pool}"
    return subprocess.check_output(cmd, shell=True).decode().splitlines()

for ds in get_datasets("poolname"):
    # Skip snapshots
    if "@" in ds:
        continue
    
    subprocess.run(f"zfs set compression=lz4 {ds}", shell=True)
    subprocess.run(f"zfs recv -F {ds} < /dev/null", shell=True)  # Trigger rewrite

When running these operations:

  • Monitor ARC hit rate (arcstat.py)
  • Consider doing this during low-usage periods
  • For large pools, process datasets sequentially
  • Keep an eye on ZIL and transaction group commits