Best Practices for ZFS on Large-Scale SANs: Direct Disk Access vs. Hardware RAID


2 views

When implementing ZFS in enterprise SAN environments, administrators face a fundamental architectural decision: whether to bypass hardware RAID controllers and let ZFS manage disks directly. While this approach is well-documented for local storage configurations (typically 2-16 disks), its applicability to large SAN deployments (400+ disks) remains less discussed.

ZFS was designed with direct disk access in mind for several reasons:

  • End-to-end checksumming requires visibility to raw disk sectors
  • ZFS's copy-on-write architecture benefits from direct block management
  • Predictive failure analysis works best with SMART data access

Example of checking disk health in ZFS:

# zpool status -v
# smartctl -a /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2K3F

For large SAN deployments, consider these approaches:

1. LUN Presentation Strategy

Instead of presenting hundreds of individual disks, create smaller logical groups:

# Example multipath configuration for SAN LUNs
/etc/multipath.conf:
multipath {
    wwid 3600508b4000abcdef000000000000123
    alias san-disk-01
    path_grouping_policy multibus
    path_selector "round-robin 0"
}

2. ZFS Pool Topology Options

For 400+ disk environments, consider hierarchical pool designs:

# Creating a meta-pool from smaller vdev groups
zpool create sanpool \
    mirror san-disk-01 san-disk-02 \
    mirror san-disk-03 san-disk-04 \
    raidz2 san-disk-05 san-disk-06 san-disk-07 san-disk-08 \
    ...

Benchmark different configurations using:

# Basic ZFS performance test
zpool iostat -v 1
fio --name=zfs-test --ioengine=libaio --rw=randwrite --bs=4k --numjobs=16 \
    --size=10G --runtime=60 --time_based --end_fsync=1

For large-scale deployments, implement automation scripts:

#!/bin/bash
# Automated disk replacement script
FAILED_DISK=$(zpool status | grep FAULTED | awk '{print $1}')
if [ -n "$FAILED_DISK" ]; then
    NEW_DISK=$(find_available_san_lun)
    zpool replace tank $FAILED_DISK $NEW_DISK
    logger "Replaced $FAILED_DISK with $NEW_DISK"
fi

Implement comprehensive monitoring:

# Prometheus exporter for ZFS
# https://github.com/pdf/zfs_exporter

# Grafana dashboard example
{
  "panels": [{
    "title": "ZFS Pool Health",
    "type": "stat",
    "targets": [{
      "expr": "zfs_health_status{pool=\"sanpool\"}"
    }]
  }]
}

When implementing ZFS in SAN environments, we face a fundamental architectural decision: whether to let ZFS manage disks directly or rely on the SAN's hardware RAID controllers. Traditional ZFS wisdom suggests direct disk access for optimal data integrity, but this approach presents unique challenges at scale.

In our 400-spindle SAN environment, we tested both approaches:

# ZFS pool creation with direct disk access
zpool create tank \
  mirror san-disk1 san-disk2 \
  mirror san-disk3 san-disk4 \
  ...

# Alternative using SAN-presented RAID volumes
zpool create tank raidz2 san-volume1 san-volume2 san-volume3

The direct-access method showed 18-22% better random IOPS but required significantly more management overhead.

Key considerations for large deployments:

  • LUN provisioning becomes complex with hundreds of raw disks
  • ZFS device replacement requires SAN-level coordination
  • Monitoring thousands of disks impacts management systems

For our production environment, we developed this hybrid approach:

# Using medium-grained RAID groups (6-12 disks) as compromise
# This bash script automates pool creation across SAN shelves

#!/bin/bash
for shelf in {1..20}; do
  disks=$(ls -1 /dev/disk/by-path/shelf${shelf}-*)
  zpool add tank raidz2 $disks
done

We integrated these tools into our ZFS-on-SAN deployment:

  1. Custom Python scripts for disk health tracking
  2. Ansible playbooks for configuration management
  3. Prometheus exporters for performance metrics

After managing 3PB+ of ZFS storage on SAN, we recommend:

  • Limit direct disk pools to <200 disks per server
  • Implement strict naming conventions for SAN disks
  • Develop comprehensive documentation of disk-SAN mappings