Scalable Storage Architecture for High-Speed Write Performance (1.1GB/s Sustained Throughput with ZFS/NFS)


1 views

When dealing with sustained write throughput of 1.1GB/s (50×75GB/hour peak), we need to consider both the storage medium and filesystem architecture. The NFS requirement adds another layer of complexity to the solution.

A hybrid approach combining SSDs and HDDs is practical for this scale:

# Example ZFS pool configuration for tiered storage
zpool create datapool \
  mirror nvme0n1 nvme1n1 \     # SSD mirror for ZIL/SLOG
  mirror nvme2n1 nvme3n1 \     # L2ARC cache
  raidz2 hdd0 hdd1 hdd2 hdd3 \ # Bulk storage
  raidz2 hdd4 hdd5 hdd6 hdd7

Key ZFS parameters to adjust:

# Recommended settings for high-write workloads
echo 33554432 > /sys/module/zfs/parameters/zfs_arc_max
echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
echo "options zfs zfs_vdev_async_write_max_active=32" >> /etc/modprobe.d/zfs.conf

For the dual 10GbE links, consider LACP bonding:

# /etc/network/interfaces example
auto bond0
iface bond0 inet manual
    bond-mode 802.3ad
    bond-miimon 100
    bond-lacp-rate 1
    bond-slaves enp1s0f0 enp1s0f1

auto bond0.100
iface bond0.100 inet static
    address 192.168.100.10
    netmask 255.255.255.0
    mtu 9000

Essential NFS server optimizations:

# /etc/exports configuration
/datapool 192.168.100.0/24(rw,async,no_wdelay,no_root_squash,no_subtree_check)

# Kernel parameters
echo 4096 > /proc/sys/net/core/netdev_max_backlog
echo 32768 > /proc/sys/net/core/somaxconn
echo "sunrpc.tcp_max_slot_table_entries=128" >> /etc/modprobe.d/sunrpc.conf

Implement proactive monitoring with these metrics:

# Basic monitoring commands
zpool iostat -v datapool 1
cat /proc/spl/kstat/zfs/datapool/io
nfsstat -o net -s

Sample server configuration:

  • 2× Intel Xeon Silver 4310 (24 cores total)
  • 256GB DDR4 ECC RAM
  • 4× 1TB NVMe (ZIL/SLOG and L2ARC)
  • 12× 16TB HDD (RAIDZ2 vdevs)
  • Dual-port 10GbE NIC
  • HBA controller (not RAID)

When dealing with peak write throughput of 1.1GB/s (50×75GB/hour), we need to consider both the storage media and filesystem architecture. The key parameters:

Peak throughput: 1100MB/s (~1.1GB/s)
Connection: Dual 10GbE (20Gbps theoretical)
Protocol: NFSv4
Data characteristics: Large sequential files
Retention: 50-75TB total

ZFS can handle these speeds with proper configuration. Critical ZFS parameters for high-throughput writes:

# Example zpool creation for high-speed writes:
zpool create -o ashift=12 tank \
    mirror nvme0n1 nvme1n1 \
    mirror nvme2n1 nvme3n1 \
    -O recordsize=1M \
    -O compression=lz4 \
    -O atime=off \
    -O xattr=sa \
    -O logbias=throughput

Key considerations:

  • Use NVMe-based ZIL (ZFS Intent Log) for sync writes
  • Large record sizes (1M) match big file patterns
  • Disable atime and enable lz4 compression reduces IOPS load

A tiered approach balances cost and performance:

# Example Linux device-mapper setup for tiering:
# Fast tier (NVMe)
pvcreate /dev/nvme0n1
vgcreate fast_tier /dev/nvme0n1

# Capacity tier (HDD)
pvcreate /dev/sd[abcdef]
vgcreate slow_tier /dev/sd[abcdef]

# Create cache pool
lvcreate -L 500G -n cache_pool fast_tier
lvcreate -L 50G -n meta_pool fast_tier

# Create cached logical volume
lvcreate -l 100%FREE -n data_volume slow_tier
lvconvert --type cache --cachepool fast_tier/cache_pool \
          --cachemode writethrough \
          --metadatapool fast_tier/meta_pool \
          slow_tier/data_volume

Essential NFS server configurations:

# /etc/exports configuration:
/storage 10.0.0.0/24(rw,async,no_wdelay,no_root_squash,
                     no_subtree_check,insecure_locks,
                     sec=sys,fsid=0)

# Recommended sysctl tweaks:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

Essential performance tools and their usage:

# FIO benchmark example for validation:
[global]
ioengine=libaio
direct=1
runtime=300
time_based

[write-test]
rw=write
bs=1M
size=100G
numjobs=4
iodepth=32
directory=/storage/test

# Monitoring commands:
zpool iostat -v tank 1
nfsiostat 1
iftop -nN -i eth0

If ZFS proves problematic at scale, consider:

  • Lustre filesystem with NVMe OSTs
  • Ceph with bluestore and NVMe journals
  • Pure SSD array with hardware RAID (consider Dell ME4 series)

For budget-conscious implementations, used enterprise NVMe drives (like Intel P4610/P4510) in ZFS mirrors provide excellent price/performance.