When working with Lucene indexes distributed across multiple directories, you often need to analyze disk usage by file extension. The typical directory structure looks like:
0/index/_2z6.frq
0/index/_2z6.fnm
1/index/_1sq.frq
1/index/_1sq.fnm
...
Here's a robust one-liner that handles this efficiently:
find . -type f -name "*.*" | awk -F. '{print $NF}' | xargs -I {} sh -c 'echo -n "{} "; find . -type f -name "*.{}" -exec du -ch {} + | grep total$ | cut -f1' | sort
For better readability and processing, this expanded version is preferred:
#!/bin/bash
# Find all files with extensions
find . -type f | grep -E "\\.[a-z0-9]+$" | \
# Extract extensions
sed -E 's/.*\.([a-z0-9]+)$/\1/' | \
# Count occurrences
sort | uniq -c | \
# Process each extension
while read count ext; do
# Calculate total size
size=$(find . -type f -name "*.$ext" -exec du -cb {} + | awk '/total$/{print $1}')
echo "$ext $size"
done | sort -k2 -nr
For very large directory structures, GNU parallel can significantly speed up the operation:
find . -type f -printf "%f\n" | \
awk -F. '{if (NF>1) print $NF}' | \
sort | uniq | \
parallel 'echo -n "{} "; find . -type f -name "*\.{}" -exec du -cb {} + | awk '\''/total$/{print $1}'\'' '
To get nicely formatted output with human-readable sizes:
find . -type f | awk -F. '/\\./ {ext=$NF; sum[ext]+=$(($(stat -f%z "$0")/1024))} END {for (e in sum) printf "%-6s %8d KB\n", e, sum[e]}' | sort
Some important considerations:
- Files without extensions (add
grep -E "\\.[a-z0-9]+$"
filter) - Case sensitivity (use
-iname
instead of-name
) - Hidden directories (add
-not -path '*/.*'
)
When working with Lucene indexes in a Linux environment, you'll often encounter directories containing numerous files with different extensions (.frq, .fnm, etc.). Developers frequently need to analyze disk usage patterns by file type for optimization or monitoring purposes.
Here's an efficient one-liner that provides the exact output format requested:
find . -type f -name "*.*" | \
awk -F. '{ext="."$NF; size[ext]+=$(stat -c "%s" $0)} END {for (e in size) print e, size[e]}' | \
sort
For systems with many files, this parallelized version using GNU parallel improves performance:
find . -type f -print0 | \
parallel -0 --bar 'echo {/.}$(stat -c " %s" {})' | \
awk -F. '{ext="."$NF; size[ext]+=$NF} END {for (e in size) print e, size[e]}' | \
sort -k2 -n
- find: Recursively locates all files
- awk -F.: Splits filename at last dot to isolate extension
- stat -c "%s": Gets file size in bytes
- parallel: Processes files concurrently for speed
Running against a test Lucene index directory might yield:
.cfs 1024000
.fdt 524288
.fdx 32768
.fnm 65536
.frq 393216
.prx 196608
.tii 131072
.tis 262144
For very large directory trees (100k+ files), consider:
- Adding
-maxdepth
to find to limit recursion
- Using
nice
to reduce CPU priority
- Outputting to temporary files for batch processing
This technique adapts easily to other scenarios by modifying the extension matching pattern. For example, to analyze log file sizes by date:
find /var/log -type f -name "*.log*" | \
awk -F. '{ext=gensub(/.*([0-9]{8}).*/,"\\1",1,$0); size[ext]+=$(stat -c "%s" $0)} END {for (e in size) print e, size[e]}'