Troubleshooting Extremely Slow mkfs Performance on Linux RAID5 Arrays (4x2TB Disks, 64k Stripe)


2 views

Creating filesystems on large RAID5 arrays shouldn't normally take 30+ minutes. Here's what we've observed with a 4-disk (2TB each) array using 64k stripe size:

# Initial array creation command
mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/sd[b-e] --chunk=64

Key symptoms that point to an underlying issue:

  • Inode table writes show irregular patterns (fast then slow)
  • Process termination hangs for ~30 seconds
  • Individual disk performance is excellent (95-110MB/s via bonnie++)

First, let's check the current RAID parameters:

cat /proc/mdstat
mdadm --detail /dev/md0

For our specific case (2.6.35 kernel), several factors could contribute:

  1. Stripe cache size: Default may be too small for initial mkfs operations
  2. Memory pressure: Check with free -m during mkfs
  3. Write intent bitmap: Missing or misconfigured

Try these adjustments before the next mkfs attempt:

# Increase stripe cache (units are in pages, typically 4KB each)
echo 8192 > /sys/block/md0/md/stripe_cache_size

# Adjust readahead (try 512KB as starting point)
blockdev --setra 512 /dev/md0

For XFS specifically, these mkfs parameters can help:

mkfs.xfs -f -d su=64k,sw=4 /dev/md0

For ext4 (modern alternative to ext3):

mkfs.ext4 -E stride=16,stripe-width=48 /dev/md0

When standard approaches fail, deeper investigation is needed:

# Monitor kernel messages during operation
dmesg -w &

# Check IO wait statistics
iostat -x 1

Particularly watch for:

  • High await values in iostat
  • Kernel messages about MD layer timeouts
  • High system CPU usage during the operation

When initializing filesystems on our newly created 4-disk RAID5 array with 64k chunk size, we encounter unusually slow mkfs operations:

  • XFS creation takes ~30 minutes (versus expected 2-3 minutes)
  • ext3 shows erratic inode table writes - fast bursts followed by 2-second pauses
  • Process termination (Ctrl+C) exhibits 30-second latency

Before blaming the RAID stack, let's confirm disk health with Bonnie++:

# Individual disk test
bonnie++ -d /dev/sdX -s 8G -n 0 -m HOSTNAME

# Parallel test (all disks)
for i in {a..d}; do
  bonnie++ -d /dev/sd$i -s 4G -n 0 -m HOSTNAME_$i &
done

Results showed consistent 95MB/s write and 110MB/s read speeds across all disks, even under parallel load.

The array was created with standard parameters:

mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/sd{a,b,c,d} --chunk=64

Current /proc/mdstat shows:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
      5860531200 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

We experimented with these sysctl adjustments without improvement:

echo 4096 > /sys/block/md0/md/stripe_cache_size
blockdev --setra 65536 /dev/md0

Let's gather concrete performance metrics during mkfs operations:

# Monitor disk I/O during operation
iostat -xmd 2 /dev/sd{a,b,c,d} /dev/md0

# Check MD layer events
dmesg -TwH

# Trace system calls
strace -o mkfs.trace -Tttt mkfs.ext3 /dev/md0

The strace output reveals frequent fdatasync() calls taking 1-2 seconds each, correlating with the observed pauses. This suggests metadata writeback synchronization overhead.

Workaround Solution: Force async operations during filesystem creation:

# For XFS
mkfs.xfs -f -K /dev/md0

# For ext3/4
mkfs.ext4 -E lazy_itable_init=1,lazy_journal_init=1 /dev/md0

Add these kernel parameters to /etc/sysctl.conf:

vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000

Then reload with sysctl -p. This reduces aggressive writeback behavior during large metadata operations.

For production systems needing maximum reliability during creation:

dd if=/dev/zero of=/dev/md0 bs=1M count=100000 status=progress
mkfs.xfs /dev/md0  # Now runs in normal time