How to Programmatically Remove Invalid Characters from Filenames in Linux/Unix Systems


2 views

When dealing with file transfers between different systems or character encodings, you might encounter filenames containing invalid or corrupted characters like this example:

009_-_�%86ndringshåndtering.html

These problematic characters often appear as replacement characters (�) or incorrectly encoded sequences, making files difficult to handle programmatically.

On Unix/Linux systems, the only truly invalid characters in filenames are:

  • Forward slash (/) - directory separator
  • Null character (\0) - string terminator

However, for practical purposes, we often want to remove:

  • Non-ASCII characters (when working with legacy systems)
  • Control characters
  • Special characters that might cause issues in specific contexts

The tr command is indeed a good solution for this problem. Here's how to use it effectively:

# Basic ASCII-only sanitization
echo "009_-_�%86ndringshåndtering.html" | tr -cd '\11\12\15\40-\176'

# More permissive version keeping some special chars
echo "009_-_�%86ndringshåndtering.html" | tr -d '\000-\011\013\014\016-\037\177-\377'

The first example keeps only printable ASCII characters (space through tilde), while the second removes control characters but keeps extended ASCII.

For batch processing multiple files, combine find with sed:

find . -name '*[^ -~]*' -exec bash -c '
  for f; do 
    new=${f//[^[:alnum:]_.-]/_}
    [ "$f" != "$new" ] && mv -i "$f" "$new"
  done
' bash {} +

This script:

  1. Finds all files with non-ASCII characters
  2. Replaces invalid chars with underscores
  3. Only renames when necessary
  4. Preserves alphanumerics, dots, underscores and hyphens

For more control, here's a Python 3 script:

import os
import re
import sys

def sanitize(filename):
    # Keep letters, digits, and safe punctuation
    filename = re.sub(r'[^ \w\-_.]', '_', filename)
    # Remove leading/trailing spaces
    filename = filename.strip()
    # Replace spaces with underscores
    filename = filename.replace(' ', '_')
    return filename

for root, dirs, files in os.walk(sys.argv[1]):
    for name in files + dirs:
        old = os.path.join(root, name)
        new = os.path.join(root, sanitize(name))
        if old != new:
            os.rename(old, new)

If you want to preserve valid Unicode while removing only corrupted characters:

# Using Perl for Unicode-aware processing
perl -e 'use Encode; use File::Find;
find(sub {
    my $new = $_;
    $new =~ s/[^\p{Letter}\p{Number}\-_.]/_/g;
    rename($_, $new) if $_ ne $new;
}, @ARGV)' .

When dealing with file processing scripts or batch operations, we often encounter filenames containing invalid characters that break our workflows. These typically occur due to:

  • Character encoding mismatches during file transfers
  • Corrupted metadata in source systems
  • Improper string handling in generating applications

The example filename 009_-_�%86ndringshåndtering.html demonstrates a common case where Unicode characters get mangled into replacement symbols.

For Linux/macOS environments, these methods work effectively:


# Method 1: Using tr for basic sanitization
echo "009_-_�%86ndringshåndtering.html" | tr -dc '[:alnum:][:space:].-_'

# Method 2: Comprehensive sanitization including accented characters
sanitize() {
  echo "$1" | sed -e 's/[^[:alnum:]_.-]//g' -e 's/ä/a/g' -e 's/ö/o/g' -e 's/å/a/g'
}

# Method 3: For batch processing files
find . -type f -name "*[^[:alnum:]._-]*" | while read file; do
  newname=$(basename "$file" | tr -d '\055-\377')
  mv "$file" "$newname"
done

For cross-platform reliability:


import re
import unicodedata

def sanitize_filename(filename):
    # Normalize unicode characters
    filename = unicodedata.normalize('NFKD', filename)
    # Remove invalid characters
    filename = re.sub(r'[^\w\-_. ]', '', filename).strip()
    return filename

print(sanitize_filename("009_-_�%86ndringshåndtering.html"))
# Output: 009_-_ndringshndtering.html

Special considerations for production systems:

  • Preserve file extensions during renaming
  • Handle duplicate filenames (add incrementing numbers)
  • Maintain case sensitivity where needed
  • Log all changes for audit purposes

# Safe rename with collision handling
counter=1
safename=$(sanitize "$original")
while [ -e "$safename" ]; do
    safename="${safename%.*}_$counter.${safename##*.}"
    ((counter++))
done
mv "$original" "$safename"