Debugging Mystery: Sudden Disk Space Fluctuations on Linux Root Partition (220G to 0B and Back)


2 views

When your root partition (/dev/sda1) suddenly reports 100% usage (220G/220G) but then magically recovers to 5% (9.3G/220G) without manual intervention, you're dealing with one of Linux's more puzzling storage mysteries. Let's break down the diagnostic approach.

When space disappears then reappears, start with these commands in sequence:

# 1. Check mounted filesystems
df -hT --exclude-type=tmpfs --exclude-type=devtmpfs

# 2. Find large directories (run as root)
sudo du -h --max-depth=1 / | sort -h -r

# 3. Check for deleted-but-open files
sudo lsof +L1 | grep deleted

# 4. Monitor changes in real-time
watch -n 5 "df -h /; echo; du -h --max-depth=1 / | sort -h -r"

Based on your output, these are likely culprits:

  • Log files explosion: Check /var/log with journalctl --disk-usage
  • Docker/container storage: Verify with docker system df
  • Temporary mounts: The /var/lib/ureadahead/debugfs mirroring root suggests a mounting artifact

Create this emergency cleanup script (disk_emergency.sh):

#!/bin/bash
# Rotate and compress large logs
sudo logrotate -f /etc/logrotate.conf
sudo journalctl --vacuum-size=200M

# Clear package manager cache
sudo apt-get clean
sudo apt-get autoclean

# Remove old kernels (Ubuntu)
sudo apt-get purge $(dpkg -l | awk '/^ii linux-image-*/{print $2}' | grep -v $(uname -r))

# Docker cleanup
command -v docker && docker system prune -af

# Find and list large files
echo "Top 10 space consumers:"
sudo find / -type f -size +100M -exec ls -lh {} + 2>/dev/null | sort -k5 -h -r | head -10

Set up inotify to track changes:

# Install inotify-tools
sudo apt install inotify-tools

# Monitor root directory changes
inotifywait -m -r --format '%w%f %e' --exclude '^/proc|^/sys' / | while read file action
do
    if [[ "$action" == *"DELETE"* || "$action" == *"CREATE"* ]]; then
        echo "$(date) - $action - $file" >> /var/log/disk_changes.log
    fi
done

For ext4 filesystems (common on Ubuntu):

# Check for filesystem errors
sudo fsck -nf /dev/sda1

# View reserved blocks (typically 5%)
sudo tune2fs -l /dev/sda1 | grep -i 'block count'

# Check for mounted snapshots
sudo cat /proc/mounts | grep sda1

The spontaneous recovery suggests:

  • A crashed process holding file descriptors released them
  • A temporary mount (possibly debugfs) was unmounted
  • Log rotation or systemd journal cleanup executed

Add to your /etc/crontab:

0 * * * * root df -h / > /var/log/disk_usage.log
30 * * * * root /usr/bin/du -h --max-depth=1 / | sort -h -r >> /var/log/disk_usage.log

Configure systemd journal limits in /etc/systemd/journald.conf:

[Journal]
SystemMaxUse=500M
RuntimeMaxUse=100M

When your Linux server suddenly reports 100% disk usage (220G/220G) then mysteriously drops to 5% (9.3G/220G) without file deletions, you're dealing with one of these classic scenarios:

# Real-time monitoring command I now keep running:
watch -n 5 "df -h; echo; du -sh /* | sort -rh | head -n 10"

Based on your du output showing only 3.3G usage versus df reporting 43G, these are likely suspects:

# Investigate open deleted files (common with log rotation)
lsof +L1 | grep -i deleted

# Check for mount point leaks
findmnt --verify

# Audit tmpfs usage
sudo du -sh /tmp /var/tmp /dev/shm

Your case shows classic signs of unflushed log files held by running processes after rotation. Try this forensic approach:

# Identify which services hold deleted logs
sudo ls -l /proc/*/fd | grep deleted

# Example output you might see:
/proc/1234/fd/4 -> /var/log/nginx/access.log.1 (deleted)

# Force log rotation and flush
sudo systemctl restart rsyslog
sudo journalctl --vacuum-size=100M

Create a cron job with this monitoring script (/usr/local/bin/disk_guardian.sh):

#!/bin/bash
THRESHOLD=90
CURRENT=$(df --output=pcent / | tail -1 | tr -dc '0-9')

if [ "$CURRENT" -ge "$THRESHOLD" ]; then
    logger "DiskGuardian: Cleaning triggered at ${CURRENT}% usage"
    
    # Rotate and compress logs
    logrotate -f /etc/logrotate.conf
    find /var/log -type f -name "*.log" -exec truncate -s 0 {} \;
    
    # Clear package manager cache
    apt-get clean || yum clean all
    
    # Clear tmp directories
    find /tmp -type f -mtime +1 -delete
    find /var/tmp -type f -mtime +1 -delete
    
    # Optional: Restart services holding deleted logs
    systemctl restart nginx php-fpm mysql
fi

When the issue recurs, use these nuclear options:

# 1. ncdu (NCurses Disk Usage)
sudo apt install ncdu
ncdu -x /

# 2. Track file creations in real-time
sudo apt install inotify-tools
inotifywait -m -r --format '%w%f' /var | tee file_creations.log

# 3. Check for rogue containers/docker
docker system df
podman ps -a --size

Add these to your /etc/sysctl.conf:

# Prevent cached memory from consuming disk
vm.vfs_cache_pressure = 50
vm.swappiness = 10

# Limit tmpfs sizes (add to /etc/fstab)
tmpfs /tmp tmpfs size=512M,nr_inodes=10k,mode=1777 0 0