Why Does `wc -c` Count an Extra Byte? Understanding Newline Characters in File Size Calculation

Many developers encounter this puzzling behavior when using wc -c:

$ echo "a" > tmp
$ wc -c tmp
2 tmp

Despite only typing a single character, the count shows 2. This isn't a bug - it's a fundamental aspect of how text files work in Unix-like systems.

The extra count comes from the newline character (\n) that's automatically appended by most text editors and commands like echo. To verify:

$ echo -n "a" > tmp  # -n suppresses newline
$ wc -c tmp
1 tmp

The behavior varies across operating systems:

# Windows (CRLF)
$ printf "a\r\n" > tmp
$ wc -c tmp
3 tmp  # 'a' + CR + LF

# Unix (LF only)
$ printf "a\n" > tmp
$ wc -c tmp
2 tmp

For accurate character counts excluding newlines:

$ grep -o . tmp | wc -l
1

Or using awk:

$ awk '{n+=length} END{print n}' tmp
1

This behavior affects:

File transfer protocols (FTP ASCII vs binary mode)
Version control line ending conversions
Hash calculations for file integrity checks

Understanding this nuance helps prevent subtle bugs in scripts processing text files.

When working with Unix/Linux systems, many developers encounter this puzzling behavior:

$ echo "a" > tmp
$ wc -c tmp
2 tmp

You'd expect wc -c (which counts bytes) to return 1 for a single character file, but it shows 2. Let's dive into why this happens.

The echo command automatically appends a newline character (\n, ASCII 10) to its output by default. This is why:

$ echo -n "a" > tmp  # -n suppresses newline
$ wc -c tmp
1 tmp

Now we get the expected count of 1. The difference comes from that invisible newline character.

Let's examine the file contents directly:

$ echo "a" > tmp
$ hexdump -C tmp
00000000  61 0a                                             |a.|
00000002

This clearly shows two bytes: 0x61 ('a') and 0x0a (newline).

It's worth noting the difference between:

$ wc -c tmp  # counts bytes
2 tmp

$ wc -m tmp  # counts characters
2 tmp

For ASCII text, these counts match, but for Unicode they may differ.

This behavior affects many scenarios:

File size validation
Network protocol implementations
Hash calculations

For example, these two files will have different MD5 sums:

$ echo -n "test" > file1
$ echo "test" > file2
$ md5sum file*
d8e8fca2dc0f896fd7cb4cb0031ba249  file1
d8e8fca2dc0f896fd7cb4cb0031ba249  file2

When precise byte counts matter:

# Method 1: Use printf
$ printf "a" > tmp
$ wc -c tmp
1 tmp

# Method 2: Use echo -n
$ echo -n "a" > tmp
$ wc -c tmp
1 tmp

# Method 3: Strip trailing newlines
$ echo "a" | tr -d '\n' > tmp
$ wc -c tmp
1 tmp

The Unix philosophy treats newlines as line terminators rather than separators. This means:

Text files should end with a newline
Many tools expect this convention
POSIX defines a "line" as ending with newline

This explains why echo adds newlines by default.

ServerDevWorker

Why Does `wc -c` Count an Extra Byte? Understanding Newline Characters in File Size Calculation

Related Articles