Many developers encounter this puzzling behavior when using wc -c
:
$ echo "a" > tmp
$ wc -c tmp
2 tmp
Despite only typing a single character, the count shows 2. This isn't a bug - it's a fundamental aspect of how text files work in Unix-like systems.
The extra count comes from the newline character (\n
) that's automatically appended by most text editors and commands like echo
. To verify:
$ echo -n "a" > tmp # -n suppresses newline
$ wc -c tmp
1 tmp
The behavior varies across operating systems:
# Windows (CRLF)
$ printf "a\r\n" > tmp
$ wc -c tmp
3 tmp # 'a' + CR + LF
# Unix (LF only)
$ printf "a\n" > tmp
$ wc -c tmp
2 tmp
For accurate character counts excluding newlines:
$ grep -o . tmp | wc -l
1
Or using awk
:
$ awk '{n+=length} END{print n}' tmp
1
This behavior affects:
- File transfer protocols (FTP ASCII vs binary mode)
- Version control line ending conversions
- Hash calculations for file integrity checks
Understanding this nuance helps prevent subtle bugs in scripts processing text files.
When working with Unix/Linux systems, many developers encounter this puzzling behavior:
$ echo "a" > tmp
$ wc -c tmp
2 tmp
You'd expect wc -c
(which counts bytes) to return 1
for a single character file, but it shows 2
. Let's dive into why this happens.
The echo
command automatically appends a newline character (\n
, ASCII 10) to its output by default. This is why:
$ echo -n "a" > tmp # -n suppresses newline
$ wc -c tmp
1 tmp
Now we get the expected count of 1
. The difference comes from that invisible newline character.
Let's examine the file contents directly:
$ echo "a" > tmp
$ hexdump -C tmp
00000000 61 0a |a.|
00000002
This clearly shows two bytes: 0x61
('a') and 0x0a
(newline).
It's worth noting the difference between:
$ wc -c tmp # counts bytes
2 tmp
$ wc -m tmp # counts characters
2 tmp
For ASCII text, these counts match, but for Unicode they may differ.
This behavior affects many scenarios:
- File size validation
- Network protocol implementations
- Hash calculations
For example, these two files will have different MD5 sums:
$ echo -n "test" > file1
$ echo "test" > file2
$ md5sum file*
d8e8fca2dc0f896fd7cb4cb0031ba249 file1
d8e8fca2dc0f896fd7cb4cb0031ba249 file2
When precise byte counts matter:
# Method 1: Use printf
$ printf "a" > tmp
$ wc -c tmp
1 tmp
# Method 2: Use echo -n
$ echo -n "a" > tmp
$ wc -c tmp
1 tmp
# Method 3: Strip trailing newlines
$ echo "a" | tr -d '\n' > tmp
$ wc -c tmp
1 tmp
The Unix philosophy treats newlines as line terminators rather than separators. This means:
- Text files should end with a newline
- Many tools expect this convention
- POSIX defines a "line" as ending with newline
This explains why echo
adds newlines by default.