Diagnosing and Resolving Sudden RAID Performance Degradation on MegaRAID SAS 9280 with Bonnie++ and MegaCLI


2 views

When our production database queries started taking 3x longer than usual, we immediately suspected storage subsystem issues. Running bonnie++ revealed shocking read speeds between 22-82 MB/s on what should be a high-performance RAID 10 array with 15k SAS drives. The erratic performance became evident when dd tests showed wild fluctuations between 15.8 MB/s to 225 MB/s.

Before deep diving, we confirmed several baseline conditions:

# Check filesystem integrity
xfs_repair -n /dev/mapper/raid_device

# Verify RAID status
megacli -LDInfo -Lall -aAll

The MegaRAID 9280's cache behavior became our primary suspect. We examined both battery status and cache policies:

# Check BBU status
megacli -AdpBbuCmd -GetBbuStatus -aAll

# Verify current cache policy
megacli -LDGetProp -Cache -LAll -aAll

We discovered the controller was unexpectedly falling back to write-through mode. This explained the performance drop but not the root cause. The firmware logs revealed interesting patterns:

# Fetch controller event log
megacli -AdpEventLog -GetEvents -f events.log -aAll

# Filter for recent warnings
grep -i "warning\|error" events.log

We isolated individual drives to test baseline performance:

# Test single drive throughput
hdparm -tT /dev/sdX

# Check for command timeouts
smartctl -l error /dev/sdX

Using blktrace revealed surprising I/O patterns:

# Capture block layer traces
blktrace -d /dev/mapper/raid_device -o tracefile

# Analyze with seek patterns
btt -i tracefile.blktrace.*

After weeks of investigation, we discovered the RAID controller's cache page size was misconfigured after a firmware update. The fix involved:

# Reset cache parameters
megacli -LDSetProp -Cached -WB -LAll -aAll

# Adjust stripe size
megacli -LDSetProp -Sz256KB -LAll -aAll

We implemented these monitoring checks:

# Cron job for weekly checks
#!/bin/bash
BBU_STATUS=$(megacli -AdpBbuCmd -GetBbuStatus -aAll | grep "Charger Status")
CACHE_MODE=$(megacli -LDGetProp -Cache -LAll -aAll | grep "Current Cache Policy")
[ "$BBU_STATUS" != "Charger Status: Charging" ] && alert_team
[ "$CACHE_MODE" == "Write Cache: Write Through" ] && alert_team

When our production database queries started showing increased latency, initial investigation pointed to storage I/O bottlenecks. Benchmarking with bonnie++ revealed alarming read speeds fluctuating between 22-82 MB/s on what should be a high-performance RAID10 array with 15k SAS drives.

# Sample bonnie++ command used for testing
bonnie++ -d /mnt/raid_array -s 16G -n 0 -m TEST -f -b -u root

Before diving deep, we verified several baseline conditions:

# Checking XFS filesystem integrity
xfs_repair -n /dev/mapper/raid_device

# Monitoring real-time I/O usage
iotop -oP

# Raw device performance test
dd if=/dev/mapper/raid_device of=/dev/null bs=1M count=1024 status=progress

The MegaRAID SAS 9280 controller showed no obvious issues in its status output, but several parameters deserved closer examination:

# Checking cache policy and BBU status
megacli -LDInfo -Lall -aAll | grep -E 'Current|Policy'

# Monitoring controller events
megacli -AdpEventLog -GetEvents -f events.log -aAll

# Checking physical disk health
megacli -PDList -aAll | grep -iE 'state|error|smart'

Several less-obvious factors emerged during investigation:

  • Stripe size mismatch: Default 256KB stripe size might not align with database access patterns
  • Write-back cache behavior: Flush intervals and cache policies affecting consistency
  • Drive timeout thresholds: Aggressive error recovery slowing down responses

We implemented several low-level tests to isolate the issue:

# Measuring raw device latency
hdparm -tT /dev/sdX

# Checking block layer queue stats
cat /sys/block/sdX/queue/nr_requests

# Monitoring interrupt distribution
cat /proc/interrupts | grep -i megasas

After comprehensive testing, these adjustments yielded significant improvements:

# Adjusting elevator and scheduler parameters
echo deadline > /sys/block/sdX/queue/scheduler
echo 256 > /sys/block/sdX/queue/nr_requests

# Controller cache tuning
megacli -LDSetProp -Cached -LAll -aAll
megacli -LDSetProp -WB -LAll -aAll

# Filesystem mount options
mount -o remount,noatime,nodiratime,logbsize=256k /mnt/raid_array

We implemented these ongoing checks:

# Custom monitoring script
#!/bin/bash
RAID_STATUS=$(megacli -LDInfo -LAll -aAll | grep -i state)
CACHE_USAGE=$(megacli -AdpCacheInfo -aAll | grep -i dirty)
DISK_LATENCY=$(iostat -x 1 3 | grep -A1 avg-cpu)

logger -t RAID_MONITOR "Status: $RAID_STATUS | Cache: $CACHE_USAGE | Latency: $DISK_LATENCY"