In distributed storage systems like GlusterFS 3.2, failures can occur silently without triggering immediate alerts. Unlike traditional filesystems, GlusterFS's distributed nature means a single brick failure might not disrupt operations immediately, creating hidden points of failure.
Here are critical metrics to monitor:
- Volume status (all peers online)
- Brick connectivity
- Replication consistency
- Self-heal status
- Disk space thresholds
This Bash script checks basic GlusterFS health and sends email alerts:
#!/bin/bash VOLUME_NAME="your_volume" ADMIN_EMAIL="admin@example.com" THRESHOLD=90 # Disk usage percentage # Check volume status if ! gluster volume status $VOLUME_NAME > /dev/null 2>&1; then echo "CRITICAL: Volume $VOLUME_NAME not available" | mail -s "GlusterFS Alert" $ADMIN_EMAIL exit 1 fi # Check brick status OFFLINE_BRICKS=$(gluster volume status $VOLUME_NAME | grep -c "Brick.*N/A") if [ $OFFLINE_BRICKS -gt 0 ]; then echo "WARNING: $OFFLINE_BRICKS brick(s) offline in $VOLUME_NAME" | mail -s "GlusterFS Alert" $ADMIN_EMAIL fi # Check disk space gluster volume status $VOLUME_NAME | grep Brick | awk '{print $4}' | while read BRICK do USAGE=$(df -h $BRICK | tail -1 | awk '{print $5}' | tr -d '%') if [ $USAGE -ge $THRESHOLD ]; then echo "WARNING: Brick $BRICK usage at $USAGE%" | mail -s "GlusterFS Alert" $ADMIN_EMAIL fi done
For production environments, consider setting up GlusterFS Prometheus exporter:
# Install the exporter wget https://github.com/gluster/gluster-prometheus/releases/download/v1.0.0/gluster-exporter-1.0.0.linux-amd64.tar.gz tar -xzf gluster-exporter-*.tar.gz ./gluster-exporter --gluster-executable=/usr/sbin/gluster # Sample Prometheus config scrape_configs: - job_name: 'gluster_exporter' static_configs: - targets: ['localhost:9719']
For replicated volumes, monitor self-heal status with this cron job:
0 * * * * /usr/sbin/gluster volume heal $VOLUME_NAME info | grep -q "Number of entries: 0" || echo "Unhealed entries found in $VOLUME_NAME" | mail -s "GlusterFS Heal Alert" admin@example.com
Configure your GlusterFS nodes to forward logs to ELK stack:
# Sample filebeat configuration for GlusterFS filebeat.inputs: - type: log paths: - /var/log/glusterfs/*.log fields: service: glusterfs output.logstash: hosts: ["logstash.example.com:5044"]
Combine these methods for comprehensive coverage:
- Script-based checks for immediate failures
- Prometheus for metrics collection
- ELK for log analysis
- Regular manual verification
GlusterFS's distributed nature makes it resilient, but its lack of built-in monitoring can hide critical failures. I recently discovered a brick had silently dropped from a volume - only through manual inspection. This prompted me to build a robust monitoring solution.
These commands form the foundation of any monitoring system:
# Check volume status
gluster volume status all
# Verify brick consistency
gluster volume heal VOLNAME info
Here's a Python script that checks for common failure scenarios and sends alerts:
#!/usr/bin/env python3
import subprocess
import smtplib
from email.mime.text import MIMEText
def check_gluster_health():
try:
# Check for offline bricks
status = subprocess.check_output(["gluster", "volume", "status", "all"])
if b"Brick" in status and b"Offline" in status:
return False
# Check self-heal status
heal = subprocess.check_output(["gluster", "volume", "heal", "VOLNAME", "info"])
if b"Number of entries: 0" not in heal:
return False
return True
except subprocess.CalledProcessError:
return False
if not check_gluster_health():
msg = MIMEText("GlusterFS health check failed!")
msg['Subject'] = 'GlusterFS Alert'
msg['From'] = 'monitor@example.com'
msg['To'] = 'admin@example.com'
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
For production environments, consider the GlusterFS Prometheus exporter:
# Install the exporter
wget https://github.com/gluster/gluster-prometheus/releases/download/v1.0.0/gluster-exporter
# Sample Prometheus config
scrape_configs:
- job_name: 'gluster'
static_configs:
- targets: ['gluster-node:9711']
- Brick online/offline status
- Self-heal backlog count
- Volume free space percentage
- Inode consumption
For basic monitoring, set up a cron job:
# Every 15 minutes
*/15 * * * * /usr/local/bin/gluster_monitor.py