How to Diagnose and Monitor Hard Disk Failures on CentOS Servers Using SMART Tools and fsck


2 views

Monitoring hard disk health on CentOS servers requires understanding several key indicators:


# Check basic disk information
lsblk -o NAME,MODEL,SIZE,ROTA

The most comprehensive approach involves SMART (Self-Monitoring, Analysis and Reporting Technology) tools:


# Install smartmontools
sudo yum install smartmontools -y

# Check SMART status
sudo smartctl -i /dev/sda

# Run short test
sudo smartctl -t short /dev/sda

# Check test results
sudo smartctl -l selftest /dev/sda

# View all SMART attributes
sudo smartctl -A /dev/sda

Pay special attention to these attributes (RAW_VALUE changes indicate problems):

  • Reallocated_Sector_Count
  • Current_Pending_Sector
  • Uncorrectable_Error_Count
  • Temperature_Celsius
  • Power_On_Hours

Regular filesystem checks can reveal developing problems:


# Unmount the filesystem first (if possible)
sudo umount /dev/sda1

# Force filesystem check on next boot
sudo touch /forcefsck

# Or run manually (ext4 example)
sudo fsck -y /dev/sda1

Set up regular checks with this cron job:


# Add to crontab -e
0 3 * * * /usr/sbin/smartctl -H /dev/sda | grep -q "PASSED" || echo "SMART test failed on /dev/sda" | mail -s "SMART Alert" admin@example.com

For deeper analysis, consider these techniques:


# Check disk performance
sudo hdparm -tT /dev/sda

# Monitor disk errors in kernel log
sudo dmesg | grep -i sda | grep error

# Check for bad blocks (warning: destructive)
# sudo badblocks -v /dev/sda

Watch for these symptoms of impending failure:

  • Increasing read/write errors in syslog
  • Unusual disk noise (clicking sounds)
  • Frequent system hangs during disk operations
  • Sudden increases in Reallocated_Sector_Count

html

Monitoring disk health is crucial for server administrators. CentOS provides several powerful tools to check for bad sectors, SMART status, and performance degradation.

Here are the most effective utilities for HDD health checks:


# Install essential tools
sudo yum install smartmontools badblocks -y

SMART (Self-Monitoring, Analysis and Reporting Technology) provides the most comprehensive disk health data:


# Check SMART overall health
sudo smartctl -H /dev/sda

# Get detailed SMART attributes
sudo smartctl -A /dev/sda

# Run short test
sudo smartctl -t short /dev/sda

# Run extended test (takes hours)
sudo smartctl -t long /dev/sda

The badblocks utility helps identify physical disk defects:


# Read-only scan (safe)
sudo badblocks -v /dev/sda

# Non-destructive read-write test
sudo badblocks -nsv /dev/sda

Performance metrics can indicate early failure signs:


# Install iotop for disk I/O monitoring
sudo yum install iotop -y

# Check disk I/O in real-time
sudo iotop -o

Create a cron job for regular monitoring:


# Add to crontab (runs weekly)
0 0 * * 0 /usr/sbin/smartctl -H /dev/sda | mail -s "Disk Health Report" admin@example.com

Key attributes to watch:

  • Reallocated_Sector_Count: Indicates bad sectors
  • Current_Pending_Sector: Sectors waiting to be remapped
  • UDMA_CRC_Error_Count: Cable/connection issues
  • Temperature_Celsius: Overheating risks

For enterprise environments, consider these additional measures:


# Check RAID array health
sudo mdadm --detail /dev/md0

# Monitor disk errors in syslog
sudo grep -i error /var/log/messages | grep sda