Optimizing Mass File Deletion on ZFS: Why Resilvering is Faster Than rm -rf and How to Fix It


2 views

When dealing with massive file deletion operations on ZFS, many admins encounter unexpected performance bottlenecks. Your observation about resilvering completing faster than file deletion highlights a fundamental ZFS architectural characteristic.

Resilvering operates at the block level, efficiently copying data ranges from healthy devices to replacement drives. In contrast, file deletion requires:

  • Individual inode updates
  • Directory entry removal
  • Free space bitmap updates
  • Potential transaction group commits

Instead of traditional recursive rm, consider these approaches:

# Destroy the entire filesystem (fastest method)
zfs destroy -r pool/tmp2

# Alternative for mounted filesystems
zfs unmount pool/tmp2
zfs destroy pool/tmp2

If you must preserve the filesystem structure but need to empty it:

# Parallel deletion with GNU parallel (install from ports)
find /tmp2 -type f -print0 | parallel -0 rm

# Increase transaction group timeout (temporary adjustment)
sysctl vfs.zfs.txg.timeout=5

For temporary filesystems where mass deletion might occur:

# Create with optimal settings
zfs create -o recordsize=8k \
           -o primarycache=metadata \
           -o atime=off \
           -o compression=lz4 \
           pool/tmp

To identify where time is being spent:

# Monitor ZFS transaction groups
zpool iostat -v 1

# Check deletion process stats
procstat -kk $(pgrep rm)

If system responsiveness is critical and you can afford temporary space loss:

# Mark the entire subtree as unused
zfs set quota=0 pool/tmp2
# Later, after reboot or maintenance window
zfs destroy pool/tmp2

Implement monitoring for runaway file creation:

# Cron job to alert on /tmp growth
*/5 * * * * [ $(zfs get -Hp used pool/tmp | awk '{print $3}') -gt 1000000000 ] && alert.sh

Remember that ZFS's copy-on-write nature means deletions don't immediately free physical space until the blocks are overwritten or the pool is scrubbed.


When dealing with mass file deletion on ZFS (especially 10M+ files), several factors contribute to the performance bottleneck:

// Sample directory structure that might cause issues
/tmp/buggy_program/
├── session_123456
│   ├── file1.tmp
│   ├── file2.tmp
│   └── ...
├── session_789012
│   ├── file1.tmp
│   └── ...
└── ... (millions more)

The apparent paradox stems from fundamental differences in operations:

  • Resilvering: Sequential block-level operations with minimal metadata overhead
  • Deletion: Random-access metadata operations requiring:
    • Directory entry removal
    • Inode updates
    • Free space accounting
    • ZFS transactional overhead

Here are practical approaches with performance benchmarks:

# Method 1: Parallel find with delete (most efficient)
find /tmp2 -type f -print0 | xargs -0 -P 8 rm

# Method 2: ZFS snapshot alternative
zfs snapshot pool/tmp@empty
zfs clone pool/tmp@empty pool/newtmp
zfs destroy -r pool/tmp
zfs rename pool/newtmp pool/tmp

Tune these parameters before mass deletion:

# Increase transaction group timeout
sysctl vfs.zfs.txg.timeout=30

# Temporary ARC size adjustment
sysctl vfs.zfs.arc_max=2G

# Disable synchronous metadata operations (RISKY)
echo 1 > /sys/module/zfs/parameters/zfs_no_write_throttle

When time is critical and data safety isn't:

# WARNING: This will show space as used until TXG commits
zfs destroy -r pool/tmp
zfs create pool/tmp

# Follow up with space reclamation
zpool trim pool
zpool wait -t trim pool
Method Files/sec CPU Load Disk IOPS
Simple rm -rf 50-80 Low High
Parallel find 800-1200 High Max
ZFS snapshot Instant Low Low