How to Decode HTML Entities from Standard Input in Linux (CentOS/Bash Solutions)

When processing text streams in Linux, particularly in CentOS environments, we often encounter HTML-encoded special characters that need conversion. Common examples include &, <, and " which should be converted to &, <, and " respectively.

For basic entity decoding, we can use these approaches:

1. Using Perl One-Liner

echo "&quot;test&quot; &amp; test" | perl -MHTML::Entities -pe 'decode_entities($_);'

2. With Python's html.parser

echo "&lt;html&gt;" | python3 -c '
import html
import sys
print(html.unescape(sys.stdin.read()))
'

For systems with recode installed:

echo "&eacute;" | recode html..ascii

When dealing with mixed content in bash scripts:

#!/bin/bash
input="&quot;complex&quot; &amp; example &lt;tag&gt;"
decoded=$(echo "$input" | python3 -c 'import html,sys; print(html.unescape(sys.stdin.read()))')
echo "$decoded"

Perl solution is fastest for large streams
Python offers most accurate HTML5 entity support
recode handles charset conversions better

Watch out for these scenarios:

# Numeric entities
echo "&#38;" | perl -MHTML::Entities -pe 'decode_entities($_);'

# Mixed encoding
echo "&amp;amp;" | python3 -c 'import html; print(html.unescape(html.unescape(sys.stdin.read())))'

When processing text data in Linux pipelines, HTML special entities like & or " can break your workflow. These encoded characters often appear in web-scraped data, API responses, or user-generated content.

The most robust solution is to use Perl's HTML::Entities module, which comes pre-installed on most CentOS systems:

# Basic usage
echo '"test" &amp;' | perl -MHTML::Entities -pe 'decode_entities($_);'

# Pipeline example
curl -s http://example.com/api | perl -MHTML::Entities -pe 'decode_entities($_);' | grep "search_term"

1. Using recode (if installed)

echo '<div>' | recode html..ascii

2. Python One-liner

echo '&lt;tag&gt;' | python3 -c 'import html, sys; print(html.unescape(sys.stdin.read()))'

3. PHP CLI Approach

echo '"quotes"' | php -r 'echo html_entity_decode(stream_get_contents(STDIN));'

For processing files with mixed content:

#!/bin/bash
# Process a file with HTML entities
input_file="data.txt"
output_file="clean.txt"

perl -MHTML::Entities -pe 'decode_entities($_);' < "$input_file" > "$output_file"

For large data streams, Perl generally outperforms other methods. Here's a benchmark test for 10,000 lines:

# Create test file
yes '&amp; < > "' | head -n 10000 > test.txt

# Time Perl
time perl -MHTML::Entities -pe 'decode_entities($_);' < test.txt > /dev/null

# Time Python
time python3 -c 'import html, sys; print(html.unescape(sys.stdin.read()))' < test.txt > /dev/null

Watch for these frequent offenders in your data:

& &amp;
< &lt;
> &gt;
" &quot;
' &apos;

ServerDevWorker