How to Identify Large Files Consuming Disk Space on Linux Servers


1 views

While df -h gives you a high-level overview of disk usage, we need deeper tools to pinpoint space hogs. Here's a more surgical approach:

du -ah / | sort -rh | head -n 20

This pipeline:

  • Scans the entire filesystem (/)
  • Outputs human-readable sizes (-h) for all files (-a)
  • Sorts results in reverse numerical order (-rh)
  • Shows only the top 20 offenders

To target specific file extensions that often grow large:

find / -type f $-name "*.log" -o -name "*.sql" -o -name "*.dump"$ -size +1G -exec ls -lh {} + 2>/dev/null

The -size +1G filter shows only files exceeding 1GB. Redirecting stderr (2>/dev/null) suppresses permission errors.

For a web server, focus on common storage areas:

for dir in /var/www /var/log /tmp /var/lib/mysql; do
    echo "=== $dir ==="
    du -sh $dir/* 2>/dev/null | sort -h
done

For graphical analysis, install and use ncdu:

sudo apt install ncdu  # Debian/Ubuntu
sudo yum install ncdu  # RHEL/CentOS
ncdu -x /

Key features:

  • Interactive navigation with arrow keys
  • Sort options (size, name, mtime)
  • Delete files directly (d key)

When dealing with multiple mounts, analyze each separately:

awk '$1 ~ "^/dev" {print $6}' /proc/mounts | while read mount; do
    echo "Largest files in $mount:"
    find "$mount" -type f -size +500M -exec ls -lh {} + 2>/dev/null
done

Create a cron job for regular reports:

#!/bin/bash
REPORT="/var/log/disk_usage_$(date +%F).log"
{
    echo "==== Top 20 Files ===="
    du -ah / 2>/dev/null | sort -rh | head -n 20
    echo -e "\n==== Largest Log Files ===="
    find /var/log -type f -size +100M -exec ls -lh {} + 2>/dev/null
} > "$REPORT"

Schedule with crontab -e:

0 3 * * * /path/to/script.sh

Large files deleted but still held by processes can be found using:

lsof -nP | grep '(deleted)' | awk '{print $9}' | sort | uniq | xargs ls -lh

To free the space, either restart the holding process or truncate the file:

# Find PID holding file
lsof -nP | grep '/path/to/file (deleted)'
# Truncate safely
: > "/proc/PID/fd/FD_NUM"

When your Linux server's storage is nearly full, these commands become your best friends:

# Show overall disk usage
df -h

# Scan directories for largest consumers
du -sh /* | sort -rh | head -n 20

# Find files >1GB recursively
find / -type f -size +1G -exec ls -lh {} + 2>/dev/null | sort -k5 -rh

For interactive analysis, ncdu is the Swiss Army knife:

sudo apt install ncdu
ncdu -x /

Navigation keys:
- / to move
- Enter to enter dir
- d to delete
- n to sort by name
- s to sort by size

Common storage black holes in web environments:

# Check PHP session files
ls -lh /var/lib/php/sessions/

# Inspect MySQL binary logs
du -sh /var/lib/mysql/mysql-bin.*

# Audit WordPress uploads
find /var/www/ -name "*.jpg" -o -name "*.mp4" | xargs du -h | sort -rh | head

Create a cron job script:

#!/bin/bash
REPORT=/var/log/disk_usage_$(date +%Y%m%d).log
echo "Top 50 files exceeding 100MB:" > $REPORT
find / -type f -size +100M -exec ls -lh {} + 2>/dev/null | sort -k5 -rh | head -50 >> $REPORT
echo "\nDirectory breakdown:" >> $REPORT
du -sh /* 2>/dev/null | sort -rh >> $REPORT

For log rotation and cleanup:

# Show largest log files
ls -lhS /var/log/*.log | head

# Set up logrotate configuration
cat > /etc/logrotate.d/custom << 'EOF'
/var/log/app/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 0640 www-data adm
}
EOF

For LVM or complex storage setups:

# Show physical volume usage
pvdisplay -m

# Analyze thin provisioned volumes
lvs -o+metadata_percent