When dealing with filenames containing special characters in Linux, I recently encountered a problem where about 500 out of 10,000 files couldn't be processed by my UTF-8 based renaming script. The error message "unable to stat file..." indicated encoding issues, but identifying the exact encodings programmatically proved challenging.
The file
command won't work for filenames (only file contents), but we have several alternatives:
# 1. Using convmv for preliminary detection
convmv -f utf8 -t utf8 --notest *.jpg
# 2. The enca utility can sometimes help
ls | enca -L none -G
Here's a bash script that attempts to detect encodings for problematic filenames:
#!/bin/bash
# Common encodings to test (modify as needed)
ENCODINGS=("ISO-8859-1" "WINDOWS-1252" "SHIFT_JIS" "GBK" "BIG5" "KOI8-R")
for file in *; do
if [[ $(convmv -f utf8 -t utf8 --notest "$file" 2>&1) =~ "Skipping" ]]; then
continue
fi
for enc in "${ENCODINGS[@]}"; do
if convmv -f "$enc" -t utf8 --notest "$file" &>/dev/null; then
echo "File '$file' might be in $enc encoding"
break
fi
done
done
When you have files with different encodings, this Python approach can help:
import os
from chardet import detect
def detect_filename_encoding(filename):
try:
result = detect(filename.encode('latin1'))
return result['encoding']
except:
return None
for root, dirs, files in os.walk('.'):
for file in files:
enc = detect_filename_encoding(file)
if enc and enc.lower() != 'utf-8':
print(f"{file}: {enc}")
After detecting encodings, use this script to normalize all filenames:
#!/bin/bash
# Dictionary of known encoding mappings (filename_pattern:encoding)
declare -A ENCODING_MAP=(
["pattern1*"]="ISO-8859-1"
["*spanish*"]="WINDOWS-1252"
# Add more mappings as discovered
)
process_file() {
local file="$1"
local sku="$2"
# Try UTF-8 first
if convmv -f utf8 -t utf8 --notest "$file" &>/dev/null; then
newname="${file%.*}_${sku}.${file##*.}"
mv "$file" "$newname"
return
fi
# Check against known patterns
for pattern in "${!ENCODING_MAP[@]}"; do
if [[ $file == $pattern ]]; then
convmv -f "${ENCODING_MAP[$pattern]}" -t utf8 --notest "$file"
newname="${file%.*}_${sku}.${file##*.}"
mv "$file" "$newname"
return
fi
done
# Fallback: try common encodings
for enc in "WINDOWS-1252" "ISO-8859-1" "GBK"; do
if convmv -f "$enc" -t utf8 --notest "$file" &>/dev/null; then
newname="${file%.*}_${sku}.${file##*.}"
mv "$file" "$newname"
return
fi
done
echo "Could not process: $file"
}
# Main processing loop
sku_counter=1000
for file in *; do
sku_counter=$((sku_counter+1))
process_file "$file" "SKU$sku_counter"
done
When dealing with thousands of image files containing special characters, encoding detection becomes crucial. The file
command won't help here since we're examining filenames, not file contents. Here's a robust approach using Linux tools:
# First, get a list of problematic files
find . -type f | while read -r filename; do
if ! echo "$filename" | grep -qP '[\x00-\x7F]'; then
echo "$filename" >> problematic_files.txt
fi
done
We can use Python's chardet
library to analyze filename encodings:
#!/usr/bin/env python3
import os
import chardet
from pathlib import Path
def detect_filename_encoding(filename):
try:
result = chardet.detect(filename.encode('latin1'))
return result['encoding'] if result['confidence'] > 0.7 else None
except:
return None
for filepath in Path('.').glob('*'):
enc = detect_filename_encoding(str(filepath))
if enc and enc.lower() != 'utf-8':
print(f"{filepath}: {enc}")
Based on experience, these are the most frequent encodings you'll encounter:
- ISO-8859-1 (Latin-1)
- Windows-1252 (CP1252)
- ISO-8859-15 (Latin-9)
- MacRoman (for older Mac systems)
Here's a complete bash script that handles the conversion while preserving special characters and appending SKU numbers:
#!/bin/bash
SKU_PREFIX="ACME2023_"
process_file() {
local src="$1"
local encoding=$(file -bi "$src" | awk -F "=" '{print $2}')
# Default to UTF-8 if detection fails
[[ -z "$encoding" ]] && encoding="UTF-8"
# Convert filename to UTF-8 if needed
if [[ "$encoding" != "utf-8" && "$encoding" != "us-ascii" ]]; then
convmv -f "$encoding" -t UTF-8 "$src" --notest
src=$(convmv -f "$encoding" -t UTF-8 "$src" 2>&1 | grep "mv " | awk '{print $3}' | tr -d "'")
fi
# Generate new filename with SKU
local base="${src%.*}"
local ext="${src##*.}"
local newname="${base//[^[:alnum:]_-]/_}_${SKU_PREFIX}.${ext,,}"
# Perform the rename
mv -v "$src" "$newname"
}
export -f process_file
export SKU_PREFIX
find . -type f -print0 | xargs -0 -I {} bash -c 'process_file "{}"'
Some filenames might contain mixed encodings or be completely malformed. For these cases, we need a fallback strategy:
# Fallback processing for stubborn files
for problem_file in $(cat problematic_files.txt); do
for encoding in iso-8859-1 windows-1252 iso-8859-15; do
convmv -f $encoding -t utf-8 "$problem_file" --notest 2>/dev/null && break
done
done