How to Rebalance Data Across Expanded ZFS Striped Mirrors (RAID10) After Disk Addition

When expanding a ZFS striped mirror (RAID10 equivalent) from 2 disks to 4 disks, the existing data remains concentrated on the original mirror pair. This creates an unbalanced workload where the new disks remain underutilized. Unlike traditional RAID systems, ZFS doesn't automatically redistribute existing data when adding disks to a vdev.

Here are three practical approaches to redistribute data evenly across all mirrors:

# Method 1: ZFS Send/Receive
# Create a new pool with desired configuration
zpool create newpool mirror disk3 disk4 mirror disk5 disk6
# Send the data to the new pool
zfs send tank/data@snapshot | zfs receive newpool/data
# Destroy old pool and rename new pool
zpool destroy tank
zpool export newpool
zpool import newpool tank

For more controlled redistribution, consider using ZFS channel programs:

-- Lua script to balance blocks across mirrors
args = ...
pool = args["pool"]
ds = args["dataset"]

zfs.sync.snapshot(pool.."/"..ds.."@balance_start")

for i,block in ipairs(zfs.list.blocks(pool.."/"..ds)) do
    if i % 2 == 0 then  -- Distribute even blocks to new mirror
        zfs.sync.rewrite(pool.."/"..ds, block, "new_mirror")
    end
end

Use these commands to verify data distribution:

# Check disk utilization
zpool iostat -v 5
# View physical block distribution
zdb -Pbbb poolname | grep mirror- | awk '{print $1,$4}'

Rebalancing large datasets can impact performance. Consider these best practices:

Schedule during low-usage periods
Set appropriate zfs_dirty_data_max
Use compression to reduce transfer size
Monitor ARC hit ratio during process

When expanding a ZFS pool with additional mirrored VDEVs (effectively converting from a 2-disk to 4-disk RAID-10 configuration), new writes will automatically distribute across all available mirrors. However, existing data remains concentrated on the original mirror pair until manually redistributed. This creates suboptimal performance where the expanded capacity isn't fully utilized for read operations.

The most effective method involves sending the entire dataset to a new location and restoring it:

# Create a recursive snapshot of the dataset
zfs snapshot -r tank/data@preredistribute

# Send/receive to new location (could be same pool)
zfs send -R tank/data@preredistribute | zfs receive -F tank/newdata

# Verify checksums
zfs list -t snapshot
zfs diff tank/data@preredistribute tank/newdata@preredistribute

# Swap datasets
zfs rename tank/data tank/olddata
zfs rename tank/newdata tank/data

# Cleanup
zfs destroy -r tank/olddata
zfs destroy tank/data@preredistribute

For systems where temporary storage isn't available, consider this online method:

# Create new filesystem with desired properties
zfs create -o recordsize=1M -o compression=lz4 tank/temp

# Use rsync for live migration (preserves permissions)
rsync -avxHAX --progress /tank/data/ /tank/temp/

# Verify data integrity
diff -r /tank/data/ /tank/temp/

# Swap filesystems
mv /tank/data /tank/olddata
mv /tank/temp /tank/data

# Old dataset cleanup
zfs destroy -r tank/olddata

During redistribution operations, monitor system performance with:

zpool iostat -v 5
arcstat.py 1
iostat -xm 5

Key parameters to tune during large redistributions:

zfs set primarycache=metadata tank/data (reduces ARC impact)
zfs set sync=disabled tank/temp (for temporary datasets)
Adjust vfs.zfs.send_holes and vfs.zfs.receive_checksums sysctls

For frequent rebalancing needs, consider this Python framework (requires ZFS 0.8+):

import subprocess
import logging
from concurrent.futures import ThreadPoolExecutor

def rebalance_dataset(dataset, temp_location="/rebalance"):
    """Smart dataset rebalancer"""
    snap_name = f"{dataset.replace('/', '_')}_rebalance"
    
    try:
        # Create atomic snapshot
        subprocess.run(["zfs", "snapshot", "-r", f"{dataset}@{snap_name}"], check=True)
        
        # Parallel send/receive with mbuffer
        send_cmd = f"zfs send -R {dataset}@{snap_name} | mbuffer -q -s 128k -m 1G | "
        recv_cmd = f"zfs receive -F -u {temp_location}"
        
        with ThreadPoolExecutor() as executor:
            executor.submit(subprocess.run, send_cmd + recv_cmd, shell=True, check=True)
        
        # Verification and atomic swap
        verify_integrity(dataset, temp_location)
        perform_swap(dataset, temp_location)
        
    except subprocess.CalledProcessError as e:
        logging.error(f"Rebalance failed: {e}")
        cleanup_failure(dataset, temp_location, snap_name)

ServerDevWorker

How to Rebalance Data Across Expanded ZFS Striped Mirrors (RAID10) After Disk Addition

Related Articles