Optimizing Mass File Deletion on ZFS: Why Resilvering is Faster Than rm -rf and How to Fix It


18 views

When dealing with massive file deletion operations on ZFS, many admins encounter unexpected performance bottlenecks. Your observation about resilvering completing faster than file deletion highlights a fundamental ZFS architectural characteristic.

Resilvering operates at the block level, efficiently copying data ranges from healthy devices to replacement drives. In contrast, file deletion requires:

  • Individual inode updates
  • Directory entry removal
  • Free space bitmap updates
  • Potential transaction group commits

Instead of traditional recursive rm, consider these approaches:

# Destroy the entire filesystem (fastest method)
zfs destroy -r pool/tmp2

# Alternative for mounted filesystems
zfs unmount pool/tmp2
zfs destroy pool/tmp2

If you must preserve the filesystem structure but need to empty it:

# Parallel deletion with GNU parallel (install from ports)
find /tmp2 -type f -print0 | parallel -0 rm

# Increase transaction group timeout (temporary adjustment)
sysctl vfs.zfs.txg.timeout=5

For temporary filesystems where mass deletion might occur:

# Create with optimal settings
zfs create -o recordsize=8k \
           -o primarycache=metadata \
           -o atime=off \
           -o compression=lz4 \
           pool/tmp

To identify where time is being spent:

# Monitor ZFS transaction groups
zpool iostat -v 1

# Check deletion process stats
procstat -kk $(pgrep rm)

If system responsiveness is critical and you can afford temporary space loss:

# Mark the entire subtree as unused
zfs set quota=0 pool/tmp2
# Later, after reboot or maintenance window
zfs destroy pool/tmp2

Implement monitoring for runaway file creation:

# Cron job to alert on /tmp growth
*/5 * * * * [ $(zfs get -Hp used pool/tmp | awk '{print $3}') -gt 1000000000 ] && alert.sh

Remember that ZFS's copy-on-write nature means deletions don't immediately free physical space until the blocks are overwritten or the pool is scrubbed.


When dealing with mass file deletion on ZFS (especially 10M+ files), several factors contribute to the performance bottleneck:

// Sample directory structure that might cause issues
/tmp/buggy_program/
├── session_123456
│   ├── file1.tmp
│   ├── file2.tmp
│   └── ...
├── session_789012
│   ├── file1.tmp
│   └── ...
└── ... (millions more)

The apparent paradox stems from fundamental differences in operations:

  • Resilvering: Sequential block-level operations with minimal metadata overhead
  • Deletion: Random-access metadata operations requiring:
    • Directory entry removal
    • Inode updates
    • Free space accounting
    • ZFS transactional overhead

Here are practical approaches with performance benchmarks:

# Method 1: Parallel find with delete (most efficient)
find /tmp2 -type f -print0 | xargs -0 -P 8 rm

# Method 2: ZFS snapshot alternative
zfs snapshot pool/tmp@empty
zfs clone pool/tmp@empty pool/newtmp
zfs destroy -r pool/tmp
zfs rename pool/newtmp pool/tmp

Tune these parameters before mass deletion:

# Increase transaction group timeout
sysctl vfs.zfs.txg.timeout=30

# Temporary ARC size adjustment
sysctl vfs.zfs.arc_max=2G

# Disable synchronous metadata operations (RISKY)
echo 1 > /sys/module/zfs/parameters/zfs_no_write_throttle

When time is critical and data safety isn't:

# WARNING: This will show space as used until TXG commits
zfs destroy -r pool/tmp
zfs create pool/tmp

# Follow up with space reclamation
zpool trim pool
zpool wait -t trim pool
Method Files/sec CPU Load Disk IOPS
Simple rm -rf 50-80 Low High
Parallel find 800-1200 High Max
ZFS snapshot Instant Low Low