How to Replace Multiple Spaces and Tabs with a Single Space in sed for Better Text Processing


2 views

When working with text data, especially in system administration or log parsing, we often encounter strings with inconsistent whitespace - a mix of tabs and multiple spaces. This makes it difficult to reliably use tools like cut for field extraction.

Consider this DNS record example:

test.de.          1547    IN      SOA     ns1.test.de. dnsmaster.test.de. 2012090701 900 1000 6000 600

The initial attempt using sed "s/[\t[:space:]]+/[:space:]/g" doesn't work because:

  • The + quantifier requires -E or -r flag in some sed versions
  • [:space:]
  • Different sed implementations handle character classes differently

Here are three working approaches:

# GNU sed (most Linux systems)
sed -E 's/[[:blank:]]+/ /g'

# BSD sed (macOS)
sed -E 's/[[:space:]]+/ /g'

# Portable version (works on both)
sed 's/[[:space:]][[:space:]]*/ /g'

Let's see how this works with our DNS record:

echo "test.de.          1547    IN      SOA     ns1.test.de. dnsmaster.test.de. 2012090701 900 1000 6000 600" |
sed -E 's/[[:space:]]+/ /g' |
cut -d " " -f 1,2,5

# Output:
# test.de. 1547 ns1.test.de.

While sed works well, sometimes other tools might be more appropriate:

# Using tr (simpler but less flexible)
tr -s '[:blank:]' ' '

# Using awk (better for complex cases)
awk '{$1=$1};1'

Remember that these solutions will:

  • Collapse all whitespace, including leading/trailing
  • Convert newlines to spaces if included in the character class
  • May behave differently with Unicode whitespace characters

When processing text data in Unix/Linux environments, we often encounter messy formatting where fields are separated by inconsistent whitespace - tabs, multiple spaces, or combinations of both. This becomes particularly problematic when trying to use tools like cut that rely on consistent delimiters.

Consider this DNS zone file entry with irregular spacing:

test.de.          1547    IN      SOA     ns1.test.de. dnsmaster.test.de. 2012090701 900 1000 6000 600

The initial approach using sed "s/[\\t[:space:]]+/[:space:]/g" doesn't work because:

  • Incorrect character class syntax
  • Improper replacement pattern
  • Missing POSIX compliant whitespace handling

POSIX Compliant Solution

sed -e 's/[[:space:]]\+/ /g'

This handles all whitespace characters (spaces, tabs, etc.) and replaces sequences of one or more with a single space.

GNU sed Extended Regex

sed -E 's/\s+/ /g'

Practical Example with DNS Data

echo "test.de.          1547    IN      SOA     ns1.test.de." | sed -e 's/[[:space:]]\+/ /g'

Output: test.de. 1547 IN SOA ns1.test.de.

For processing entire files while preserving empty lines:

sed -e 's/[[:space:]]\{1,\}/ /g' -e '/^$/! s/ $//' zonefile.txt

For large files, these alternatives may be more efficient:

tr -s '[:space:]' ' ' < inputfile

Or using awk:

awk '{$1=$1};1' file