How to Recursively Find and Analyze Large Files and Directories in Linux


3 views

When dealing with storage issues on Linux systems, identifying large files and directories is a common troubleshooting step. This article provides practical methods to recursively scan your filesystem and pinpoint space-consuming items.

The du (disk usage) command is the most straightforward tool for this task. Here's a powerful combination:

du -ahx --max-depth=1 /path/to/directory | sort -rh | head -n 20

This command:

  • Shows human-readable sizes (-h)
  • Includes files (-a)
  • Prevents crossing filesystem boundaries (-x)
  • Limits to one directory level (--max-depth=1)
  • Sorts by size in reverse order
  • Displays top 20 results

For interactive exploration, install ncdu (NCurses Disk Usage):

sudo apt install ncdu  # Debian/Ubuntu
ncdu /path/to/scan

Key features:

  • Interactive navigation with arrow keys
  • Percentage-based visualization
  • Option to delete files directly
  • Fast scanning with progress indicator

To specifically target large files (e.g., >100MB):

find /path/to/search -type f -size +100M -exec ls -lh {} + | \
awk '{ print $5 ": " $9 }' | sort -hr

Here's a bash script that provides a hierarchical view:

#!/bin/bash
depth=${1:-3}
top=${2:-10}

du -ak /path/to/scan | sort -nr | \
awk -v depth=$depth -v top=$top '
BEGIN { prev_size=0; prev_name=""; indent=0 }
{
    curr_size=$1;
    curr_name=$2;
    split(curr_name,path,"/");
    curr_depth=length(path)-1;
    
    if (curr_depth <= depth) {
        if (curr_size != prev_size) {
            rank++;
        }
        if (rank <= top) {
            printf "%"indent"s", "";
            printf "%s %s\n", curr_size/1024"MB", curr_name;
            prev_size=curr_size;
            prev_name=curr_name;
            indent=curr_depth*4;
        }
    }
}'

To find space consumed by particular file extensions:

find /var/log -name "*.log" -exec du -ch {} + | grep total$

For a modern alternative with Go implementation:

# Install
go install github.com/dundee/gdu/v4/cmd/gdu@latest

# Usage
gdu --show-disks /

When managing server storage or debugging disk space issues, identifying space-consuming files and directories is crucial. Modern systems often contain millions of files, making manual inspection impractical.

The most efficient approach combines GNU coreutils with sorting:

du -ah /path/to/directory | sort -rh | head -n 20

Breakdown:
du estimates file space usage
-a shows all files (not just directories)
-h human-readable format
sort -rh sorts human-readable numbers in reverse order

For more control over file types and modification times:

find /path -type f -exec du -h {} + 2>/dev/null | sort -rh | head -n 50

For GUI-oriented users, consider these tools:

# NCurses-based
sudo apt install ncdu
ncdu /path/to/scan

# Graphical alternative
sudo apt install baobab
baobab

To examine directory structures at specific levels:

du -h --max-depth=3 / | sort -rh | head -n 15

For production environments:

# Real-time monitoring
inotifywait -m -r /path -e create,delete,modify,move

# Periodic reporting
#!/bin/bash
REPORT_FILE="/var/log/disk_usage_$(date +%Y%m%d).log"
du -ah / 2>/dev/null | sort -rh | head -100 > "$REPORT_FILE"

For distributed systems:

# Parallel processing with GNU parallel
find / -type f -print0 | parallel -0 du -h | sort -rh | head -n 50
  • IO-intensive operations may impact production systems
  • Consider running during off-peak hours
  • For large filesystems, sample a subset first