Mitigating SSD Corruption from Power Loss: Filesystem Protection Techniques for Linux Systems


2 views

What we're seeing goes beyond typical filesystem corruption. When power loss causes inodes to mutate (files becoming directories or vice versa) and alters permissions on unchanged files, this suggests deep filesystem metadata corruption. Unlike traditional HDDs where corruption typically affects recently written sectors, SSDs exhibit unique failure modes due to:

  • Flash translation layer (FTL) inconsistencies
  • Partial page programming effects
  • Write amplification during unexpected power cycles

For ext4 filesystems, implement these safeguards in /etc/fstab:

UUID=your-uuid / ext4 defaults,data=journal,commit=30,barrier=1,noatime 0 1

Critical mount options explanation:

  • data=journal: Journals both metadata AND file contents (performance impact but maximum safety)
  • commit=30: Forces sync every 30 seconds instead of default 5
  • barrier=1: Ensures proper write ordering (especially crucial for SSDs)

Disable write caching (temporary solution until proper UPS implementation):

# Check current cache status
hdparm -W /dev/sda

# Disable write cache
hdparm -W0 /dev/sda

# Make persistent (add to rc.local)
echo 'hdparm -W0 /dev/sda' >> /etc/rc.local

For critical systems, consider implementing a read-only root with overlayfs:

mount -t overlay overlay -o lowerdir=/ro,upperdir=/rw,workdir=/work /merged

For PostgreSQL, enforce strict durability settings in postgresql.conf:

fsync = on
full_page_writes = on
synchronous_commit = on
wal_level = replica

Create a pre-shutdown hook script (/etc/systemd/system/postgresql-powerfail.service):

[Unit]
Description=PostgreSQL emergency flush
DefaultDependencies=no
Before=shutdown.target reboot.target halt.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/true
ExecStop=/usr/bin/runuser -l postgres -c '/usr/bin/psql -c "CHECKPOINT;"'

[Install]
WantedBy=shutdown.target

When hardware UPS isn't available, implement basic power monitoring:

#!/bin/bash
# Monitor battery status and trigger safe shutdown
while true; do
  if [[ $(cat /sys/class/power_supply/BAT0/status) == "Discharging" ]] && 
     (( $(cat /sys/class/power_supply/BAT0/capacity) < 10 )); then
    sync
    /sbin/reboot -h now
  fi
  sleep 60
done

Implement automated fsck on boot by creating /etc/initramfs-tools/scripts/init-premount/fsck_force:

#!/bin/sh
PREREQ=""
prereqs() { echo "$PREREQ"; }
case "$1" in
  prereqs) prereqs; exit 0;;
esac

for DEVICE in $(lsblk -o KNAME -lpn); do
  if [ -z "$(blkid -s TYPE -o value $DEVICE)" ]; then continue; fi
  fsck -A -C -T -t noopts=_netdev -a $DEVICE || fsck -y $DEVICE
done

Make executable and update initramfs:

chmod +x /etc/initramfs-tools/scripts/init-premount/fsck_force
update-initramfs -u

When dealing with post-power-loss corruption on SSDs, we're observing anomalies that transcend typical filesystem issues. Unlike traditional storage corruption limited to recently modified files, we're seeing:

# Example of corrupted inode structure (hypothetical debug output)
$ ls -lai /var/www/html
12345 drwxr-xr-x 2 root root 4096 Jan 1 00:00 index.php  # File became directory
67890 -rw-r--r-- 1 root root    0 Jan 1 00:00 assets/    # Directory became file

Consumer-grade SSDs often exhibit three critical vulnerabilities during power loss:

  • Volatile write caches not properly flushed
  • FTL (Flash Translation Layer) mapping tables corruption
  • Partial page programming in NAND cells

This explains why even unchanged files get corrupted - the metadata structures in FTL may reference wrong physical blocks.

1. Filesystem Mount Options

Add these to /etc/fstab for critical partitions:

/dev/sda1  /  ext4  defaults,data=journal,barrier=1,noauto_da_alloc  0  1
/dev/sda2  /var/lib/postgresql  ext4  defaults,data=journal,nodelalloc  0  2

2. SSD Hardware Configuration

Disable volatile cache (caution: impacts performance):

# For Kingston drives shown in example
sudo hdparm -W0 /dev/sda
sudo hdparm -J0 /dev/sda  # Disable write cache flushing

# Make persistent via udev rule:
echo 'ACTION=="add", SUBSYSTEM=="block", ATTRS{model}=="KINGSTON*", RUN+="/sbin/hdparm -W0 /dev/%k"' | sudo tee /etc/udev/rules.d/99-ssd-safety.rules

3. PostgreSQL-Specific Protections

# postgresql.conf critical settings:
wal_level = replica
synchronous_commit = on
full_page_writes = on
fsync = on

Create a failsafe initramfs script to verify critical partitions:

#!/bin/sh
# /usr/share/initramfs-tools/scripts/init-premount/fsck_ssd

case "$1" in
  prereqs)
    echo ""
    exit 0
    ;;
esac

fsck -y -f /dev/disk/by-label/rootfs || {
  logger -t fsck_ssd "Critical filesystem errors detected"
  mount -o remount,ro / || emergency_shell
}

Implement proactive monitoring with smartmontools:

# smartd configuration (/etc/smartd.conf)
/dev/sda -a -o on -S on -n standby -s (S/../.././02|L/../../7/03) -W 4,35,40 \
-m admin@example.com -M exec /usr/local/bin/ssd_alert

The accompanying alert script:

#!/bin/bash
# /usr/local/bin/ssd_alert

MEDIA_WEAROUT_INDICATOR=$(smartctl -A /dev/$1 | awk '/Media_Wearout_Indicator/ {print $4}')
if [ $MEDIA_WEAROUT_INDICATOR -lt 20 ]; then
  systemctl start emergency-readonly.service
fi

For mission-critical deployments, consider:

  • Power-loss-protected (PLP) enterprise SSDs (e.g., Intel DC series)
  • Hardware RAID controllers with battery-backed cache
  • Distributed filesystems with checksumming (ZFS, Btrfs)