How to Detect and Convert Non-UTF-8 Filename Encodings in Linux for Bulk File Processing


2 views

When dealing with filenames containing special characters in Linux, I recently encountered a problem where about 500 out of 10,000 files couldn't be processed by my UTF-8 based renaming script. The error message "unable to stat file..." indicated encoding issues, but identifying the exact encodings programmatically proved challenging.

The file command won't work for filenames (only file contents), but we have several alternatives:

# 1. Using convmv for preliminary detection
convmv -f utf8 -t utf8 --notest *.jpg

# 2. The enca utility can sometimes help
ls | enca -L none -G

Here's a bash script that attempts to detect encodings for problematic filenames:

#!/bin/bash

# Common encodings to test (modify as needed)
ENCODINGS=("ISO-8859-1" "WINDOWS-1252" "SHIFT_JIS" "GBK" "BIG5" "KOI8-R")

for file in *; do
    if [[ $(convmv -f utf8 -t utf8 --notest "$file" 2>&1) =~ "Skipping" ]]; then
        continue
    fi
    
    for enc in "${ENCODINGS[@]}"; do
        if convmv -f "$enc" -t utf8 --notest "$file" &>/dev/null; then
            echo "File '$file' might be in $enc encoding"
            break
        fi
    done
done

When you have files with different encodings, this Python approach can help:

import os
from chardet import detect

def detect_filename_encoding(filename):
    try:
        result = detect(filename.encode('latin1'))
        return result['encoding']
    except:
        return None

for root, dirs, files in os.walk('.'):
    for file in files:
        enc = detect_filename_encoding(file)
        if enc and enc.lower() != 'utf-8':
            print(f"{file}: {enc}")

After detecting encodings, use this script to normalize all filenames:

#!/bin/bash

# Dictionary of known encoding mappings (filename_pattern:encoding)
declare -A ENCODING_MAP=(
    ["pattern1*"]="ISO-8859-1"
    ["*spanish*"]="WINDOWS-1252"
    # Add more mappings as discovered
)

process_file() {
    local file="$1"
    local sku="$2"
    
    # Try UTF-8 first
    if convmv -f utf8 -t utf8 --notest "$file" &>/dev/null; then
        newname="${file%.*}_${sku}.${file##*.}"
        mv "$file" "$newname"
        return
    fi
    
    # Check against known patterns
    for pattern in "${!ENCODING_MAP[@]}"; do
        if [[ $file == $pattern ]]; then
            convmv -f "${ENCODING_MAP[$pattern]}" -t utf8 --notest "$file"
            newname="${file%.*}_${sku}.${file##*.}"
            mv "$file" "$newname"
            return
        fi
    done
    
    # Fallback: try common encodings
    for enc in "WINDOWS-1252" "ISO-8859-1" "GBK"; do
        if convmv -f "$enc" -t utf8 --notest "$file" &>/dev/null; then
            newname="${file%.*}_${sku}.${file##*.}"
            mv "$file" "$newname"
            return
        fi
    done
    
    echo "Could not process: $file"
}

# Main processing loop
sku_counter=1000
for file in *; do
    sku_counter=$((sku_counter+1))
    process_file "$file" "SKU$sku_counter"
done

When dealing with thousands of image files containing special characters, encoding detection becomes crucial. The file command won't help here since we're examining filenames, not file contents. Here's a robust approach using Linux tools:


# First, get a list of problematic files
find . -type f | while read -r filename; do
    if ! echo "$filename" | grep -qP '[\x00-\x7F]'; then
        echo "$filename" >> problematic_files.txt
    fi
done

We can use Python's chardet library to analyze filename encodings:


#!/usr/bin/env python3
import os
import chardet
from pathlib import Path

def detect_filename_encoding(filename):
    try:
        result = chardet.detect(filename.encode('latin1'))
        return result['encoding'] if result['confidence'] > 0.7 else None
    except:
        return None

for filepath in Path('.').glob('*'):
    enc = detect_filename_encoding(str(filepath))
    if enc and enc.lower() != 'utf-8':
        print(f"{filepath}: {enc}")

Based on experience, these are the most frequent encodings you'll encounter:

  • ISO-8859-1 (Latin-1)
  • Windows-1252 (CP1252)
  • ISO-8859-15 (Latin-9)
  • MacRoman (for older Mac systems)

Here's a complete bash script that handles the conversion while preserving special characters and appending SKU numbers:


#!/bin/bash
SKU_PREFIX="ACME2023_"

process_file() {
    local src="$1"
    local encoding=$(file -bi "$src" | awk -F "=" '{print $2}')
    
    # Default to UTF-8 if detection fails
    [[ -z "$encoding" ]] && encoding="UTF-8"
    
    # Convert filename to UTF-8 if needed
    if [[ "$encoding" != "utf-8" && "$encoding" != "us-ascii" ]]; then
        convmv -f "$encoding" -t UTF-8 "$src" --notest
        src=$(convmv -f "$encoding" -t UTF-8 "$src" 2>&1 | grep "mv " | awk '{print $3}' | tr -d "'")
    fi
    
    # Generate new filename with SKU
    local base="${src%.*}"
    local ext="${src##*.}"
    local newname="${base//[^[:alnum:]_-]/_}_${SKU_PREFIX}.${ext,,}"
    
    # Perform the rename
    mv -v "$src" "$newname"
}

export -f process_file
export SKU_PREFIX

find . -type f -print0 | xargs -0 -I {} bash -c 'process_file "{}"'

Some filenames might contain mixed encodings or be completely malformed. For these cases, we need a fallback strategy:


# Fallback processing for stubborn files
for problem_file in $(cat problematic_files.txt); do
    for encoding in iso-8859-1 windows-1252 iso-8859-15; do
        convmv -f $encoding -t utf-8 "$problem_file" --notest 2>/dev/null && break
    done
done