Efficient Shell Script to Calculate Total File Size by Extension in Lucene Indexes


7 views

When working with Lucene indexes distributed across multiple directories, you often need to analyze disk usage by file extension. The typical directory structure looks like:

0/index/_2z6.frq
0/index/_2z6.fnm
1/index/_1sq.frq
1/index/_1sq.fnm
...

Here's a robust one-liner that handles this efficiently:

find . -type f -name "*.*" | awk -F. '{print $NF}' | xargs -I {} sh -c 'echo -n "{} "; find . -type f -name "*.{}" -exec du -ch {} + | grep total$ | cut -f1' | sort

For better readability and processing, this expanded version is preferred:

#!/bin/bash

# Find all files with extensions
find . -type f | grep -E "\\.[a-z0-9]+$" | \
# Extract extensions
sed -E 's/.*\.([a-z0-9]+)$/\1/' | \
# Count occurrences
sort | uniq -c | \
# Process each extension
while read count ext; do
    # Calculate total size
    size=$(find . -type f -name "*.$ext" -exec du -cb {} + | awk '/total$/{print $1}')
    echo "$ext $size"
done | sort -k2 -nr

For very large directory structures, GNU parallel can significantly speed up the operation:

find . -type f -printf "%f\n" | \
awk -F. '{if (NF>1) print $NF}' | \
sort | uniq | \
parallel 'echo -n "{} "; find . -type f -name "*\.{}" -exec du -cb {} + | awk '\''/total$/{print $1}'\'' '

To get nicely formatted output with human-readable sizes:

find . -type f | awk -F. '/\\./ {ext=$NF; sum[ext]+=$(($(stat -f%z "$0")/1024))} END {for (e in sum) printf "%-6s %8d KB\n", e, sum[e]}' | sort

Some important considerations:

  • Files without extensions (add grep -E "\\.[a-z0-9]+$" filter)
  • Case sensitivity (use -iname instead of -name)
  • Hidden directories (add -not -path '*/.*')



When working with Lucene indexes in a Linux environment, you'll often encounter directories containing numerous files with different extensions (.frq, .fnm, etc.). Developers frequently need to analyze disk usage patterns by file type for optimization or monitoring purposes.

Here's an efficient one-liner that provides the exact output format requested:

find . -type f -name "*.*" | \ awk -F. '{ext="."$NF; size[ext]+=$(stat -c "%s" $0)} END {for (e in size) print e, size[e]}' | \ sort

For systems with many files, this parallelized version using GNU parallel improves performance:

find . -type f -print0 | \ parallel -0 --bar 'echo {/.}$(stat -c " %s" {})' | \ awk -F. '{ext="."$NF; size[ext]+=$NF} END {for (e in size) print e, size[e]}' | \ sort -k2 -n
  • find: Recursively locates all files
  • awk -F.: Splits filename at last dot to isolate extension
  • stat -c "%s": Gets file size in bytes
  • parallel: Processes files concurrently for speed

Running against a test Lucene index directory might yield:

.cfs 1024000 .fdt 524288 .fdx 32768 .fnm 65536 .frq 393216 .prx 196608 .tii 131072 .tis 262144

For very large directory trees (100k+ files), consider:

  • Adding -maxdepth to find to limit recursion
  • Using nice to reduce CPU priority
  • Outputting to temporary files for batch processing

This technique adapts easily to other scenarios by modifying the extension matching pattern. For example, to analyze log file sizes by date:

find /var/log -type f -name "*.log*" | \ awk -F. '{ext=gensub(/.*([0-9]{8}).*/,"\\1",1,$0); size[ext]+=$(stat -c "%s" $0)} END {for (e in size) print e, size[e]}'