When working with text data, especially in system administration or log parsing, we often encounter strings with inconsistent whitespace - a mix of tabs and multiple spaces. This makes it difficult to reliably use tools like cut
for field extraction.
Consider this DNS record example:
test.de. 1547 IN SOA ns1.test.de. dnsmaster.test.de. 2012090701 900 1000 6000 600
The initial attempt using sed "s/[\t[:space:]]+/[:space:]/g"
doesn't work because:
- The
+
quantifier requires-E
or-r
flag in some sed versions [:space:]
- Different sed implementations handle character classes differently
Here are three working approaches:
# GNU sed (most Linux systems)
sed -E 's/[[:blank:]]+/ /g'
# BSD sed (macOS)
sed -E 's/[[:space:]]+/ /g'
# Portable version (works on both)
sed 's/[[:space:]][[:space:]]*/ /g'
Let's see how this works with our DNS record:
echo "test.de. 1547 IN SOA ns1.test.de. dnsmaster.test.de. 2012090701 900 1000 6000 600" |
sed -E 's/[[:space:]]+/ /g' |
cut -d " " -f 1,2,5
# Output:
# test.de. 1547 ns1.test.de.
While sed works well, sometimes other tools might be more appropriate:
# Using tr (simpler but less flexible)
tr -s '[:blank:]' ' '
# Using awk (better for complex cases)
awk '{$1=$1};1'
Remember that these solutions will:
- Collapse all whitespace, including leading/trailing
- Convert newlines to spaces if included in the character class
- May behave differently with Unicode whitespace characters
When processing text data in Unix/Linux environments, we often encounter messy formatting where fields are separated by inconsistent whitespace - tabs, multiple spaces, or combinations of both. This becomes particularly problematic when trying to use tools like cut
that rely on consistent delimiters.
Consider this DNS zone file entry with irregular spacing:
test.de. 1547 IN SOA ns1.test.de. dnsmaster.test.de. 2012090701 900 1000 6000 600
The initial approach using sed "s/[\\t[:space:]]+/[:space:]/g"
doesn't work because:
- Incorrect character class syntax
- Improper replacement pattern
- Missing POSIX compliant whitespace handling
POSIX Compliant Solution
sed -e 's/[[:space:]]\+/ /g'
This handles all whitespace characters (spaces, tabs, etc.) and replaces sequences of one or more with a single space.
GNU sed Extended Regex
sed -E 's/\s+/ /g'
Practical Example with DNS Data
echo "test.de. 1547 IN SOA ns1.test.de." | sed -e 's/[[:space:]]\+/ /g'
Output: test.de. 1547 IN SOA ns1.test.de.
For processing entire files while preserving empty lines:
sed -e 's/[[:space:]]\{1,\}/ /g' -e '/^$/! s/ $//' zonefile.txt
For large files, these alternatives may be more efficient:
tr -s '[:space:]' ' ' < inputfile
Or using awk:
awk '{$1=$1};1' file