When processing text streams in Linux, particularly in CentOS environments, we often encounter HTML-encoded special characters that need conversion. Common examples include &
, <
, and "
which should be converted to &, <, and " respectively.
For basic entity decoding, we can use these approaches:
1. Using Perl One-Liner
echo ""test" & test" | perl -MHTML::Entities -pe 'decode_entities($_);'
2. With Python's html.parser
echo "<html>" | python3 -c '
import html
import sys
print(html.unescape(sys.stdin.read()))
'
For systems with recode
installed:
echo "é" | recode html..ascii
When dealing with mixed content in bash scripts:
#!/bin/bash
input=""complex" & example <tag>"
decoded=$(echo "$input" | python3 -c 'import html,sys; print(html.unescape(sys.stdin.read()))')
echo "$decoded"
- Perl solution is fastest for large streams
- Python offers most accurate HTML5 entity support
- recode handles charset conversions better
Watch out for these scenarios:
# Numeric entities
echo "&" | perl -MHTML::Entities -pe 'decode_entities($_);'
# Mixed encoding
echo "&amp;" | python3 -c 'import html; print(html.unescape(html.unescape(sys.stdin.read())))'
When processing text data in Linux pipelines, HTML special entities like &
or "
can break your workflow. These encoded characters often appear in web-scraped data, API responses, or user-generated content.
The most robust solution is to use Perl's HTML::Entities
module, which comes pre-installed on most CentOS systems:
# Basic usage echo '"test" &' | perl -MHTML::Entities -pe 'decode_entities($_);' # Pipeline example curl -s http://example.com/api | perl -MHTML::Entities -pe 'decode_entities($_);' | grep "search_term"
1. Using recode (if installed)
echo '<div>' | recode html..ascii
2. Python One-liner
echo '<tag>' | python3 -c 'import html, sys; print(html.unescape(sys.stdin.read()))'
3. PHP CLI Approach
echo '"quotes"' | php -r 'echo html_entity_decode(stream_get_contents(STDIN));'
For processing files with mixed content:
#!/bin/bash # Process a file with HTML entities input_file="data.txt" output_file="clean.txt" perl -MHTML::Entities -pe 'decode_entities($_);' < "$input_file" > "$output_file"
For large data streams, Perl generally outperforms other methods. Here's a benchmark test for 10,000 lines:
# Create test file yes '& < > "' | head -n 10000 > test.txt # Time Perl time perl -MHTML::Entities -pe 'decode_entities($_);' < test.txt > /dev/null # Time Python time python3 -c 'import html, sys; print(html.unescape(sys.stdin.read()))' < test.txt > /dev/null
Watch for these frequent offenders in your data:
& & < < > > " " ' '