When dealing with file transfers between different systems or character encodings, you might encounter filenames containing invalid or corrupted characters like this example:
009_-_�%86ndringshåndtering.html
These problematic characters often appear as replacement characters (�) or incorrectly encoded sequences, making files difficult to handle programmatically.
On Unix/Linux systems, the only truly invalid characters in filenames are:
- Forward slash (/) - directory separator
- Null character (\0) - string terminator
However, for practical purposes, we often want to remove:
- Non-ASCII characters (when working with legacy systems)
- Control characters
- Special characters that might cause issues in specific contexts
The tr
command is indeed a good solution for this problem. Here's how to use it effectively:
# Basic ASCII-only sanitization
echo "009_-_�%86ndringshåndtering.html" | tr -cd '\11\12\15\40-\176'
# More permissive version keeping some special chars
echo "009_-_�%86ndringshåndtering.html" | tr -d '\000-\011\013\014\016-\037\177-\377'
The first example keeps only printable ASCII characters (space through tilde), while the second removes control characters but keeps extended ASCII.
For batch processing multiple files, combine find
with sed
:
find . -name '*[^ -~]*' -exec bash -c '
for f; do
new=${f//[^[:alnum:]_.-]/_}
[ "$f" != "$new" ] && mv -i "$f" "$new"
done
' bash {} +
This script:
- Finds all files with non-ASCII characters
- Replaces invalid chars with underscores
- Only renames when necessary
- Preserves alphanumerics, dots, underscores and hyphens
For more control, here's a Python 3 script:
import os
import re
import sys
def sanitize(filename):
# Keep letters, digits, and safe punctuation
filename = re.sub(r'[^ \w\-_.]', '_', filename)
# Remove leading/trailing spaces
filename = filename.strip()
# Replace spaces with underscores
filename = filename.replace(' ', '_')
return filename
for root, dirs, files in os.walk(sys.argv[1]):
for name in files + dirs:
old = os.path.join(root, name)
new = os.path.join(root, sanitize(name))
if old != new:
os.rename(old, new)
If you want to preserve valid Unicode while removing only corrupted characters:
# Using Perl for Unicode-aware processing
perl -e 'use Encode; use File::Find;
find(sub {
my $new = $_;
$new =~ s/[^\p{Letter}\p{Number}\-_.]/_/g;
rename($_, $new) if $_ ne $new;
}, @ARGV)' .
When dealing with file processing scripts or batch operations, we often encounter filenames containing invalid characters that break our workflows. These typically occur due to:
- Character encoding mismatches during file transfers
- Corrupted metadata in source systems
- Improper string handling in generating applications
The example filename 009_-_�%86ndringshåndtering.html
demonstrates a common case where Unicode characters get mangled into replacement symbols.
For Linux/macOS environments, these methods work effectively:
# Method 1: Using tr for basic sanitization
echo "009_-_�%86ndringshåndtering.html" | tr -dc '[:alnum:][:space:].-_'
# Method 2: Comprehensive sanitization including accented characters
sanitize() {
echo "$1" | sed -e 's/[^[:alnum:]_.-]//g' -e 's/ä/a/g' -e 's/ö/o/g' -e 's/å/a/g'
}
# Method 3: For batch processing files
find . -type f -name "*[^[:alnum:]._-]*" | while read file; do
newname=$(basename "$file" | tr -d '\055-\377')
mv "$file" "$newname"
done
For cross-platform reliability:
import re
import unicodedata
def sanitize_filename(filename):
# Normalize unicode characters
filename = unicodedata.normalize('NFKD', filename)
# Remove invalid characters
filename = re.sub(r'[^\w\-_. ]', '', filename).strip()
return filename
print(sanitize_filename("009_-_�%86ndringshåndtering.html"))
# Output: 009_-_ndringshndtering.html
Special considerations for production systems:
- Preserve file extensions during renaming
- Handle duplicate filenames (add incrementing numbers)
- Maintain case sensitivity where needed
- Log all changes for audit purposes
# Safe rename with collision handling
counter=1
safename=$(sanitize "$original")
while [ -e "$safename" ]; do
safename="${safename%.*}_$counter.${safename##*.}"
((counter++))
done
mv "$original" "$safename"