Why Does `wc -c` Count an Extra Byte? Understanding Newline Characters in File Size Calculation


2 views

Many developers encounter this puzzling behavior when using wc -c:

$ echo "a" > tmp
$ wc -c tmp
2 tmp

Despite only typing a single character, the count shows 2. This isn't a bug - it's a fundamental aspect of how text files work in Unix-like systems.

The extra count comes from the newline character (\n) that's automatically appended by most text editors and commands like echo. To verify:

$ echo -n "a" > tmp  # -n suppresses newline
$ wc -c tmp
1 tmp

The behavior varies across operating systems:

# Windows (CRLF)
$ printf "a\r\n" > tmp
$ wc -c tmp
3 tmp  # 'a' + CR + LF

# Unix (LF only)
$ printf "a\n" > tmp
$ wc -c tmp
2 tmp

For accurate character counts excluding newlines:

$ grep -o . tmp | wc -l
1

Or using awk:

$ awk '{n+=length} END{print n}' tmp
1

This behavior affects:

  • File transfer protocols (FTP ASCII vs binary mode)
  • Version control line ending conversions
  • Hash calculations for file integrity checks

Understanding this nuance helps prevent subtle bugs in scripts processing text files.


When working with Unix/Linux systems, many developers encounter this puzzling behavior:

$ echo "a" > tmp
$ wc -c tmp
2 tmp

You'd expect wc -c (which counts bytes) to return 1 for a single character file, but it shows 2. Let's dive into why this happens.

The echo command automatically appends a newline character (\n, ASCII 10) to its output by default. This is why:

$ echo -n "a" > tmp  # -n suppresses newline
$ wc -c tmp
1 tmp

Now we get the expected count of 1. The difference comes from that invisible newline character.

Let's examine the file contents directly:

$ echo "a" > tmp
$ hexdump -C tmp
00000000  61 0a                                             |a.|
00000002

This clearly shows two bytes: 0x61 ('a') and 0x0a (newline).

It's worth noting the difference between:

$ wc -c tmp  # counts bytes
2 tmp

$ wc -m tmp  # counts characters
2 tmp

For ASCII text, these counts match, but for Unicode they may differ.

This behavior affects many scenarios:

  • File size validation
  • Network protocol implementations
  • Hash calculations

For example, these two files will have different MD5 sums:

$ echo -n "test" > file1
$ echo "test" > file2
$ md5sum file*
d8e8fca2dc0f896fd7cb4cb0031ba249  file1
d8e8fca2dc0f896fd7cb4cb0031ba249  file2

When precise byte counts matter:

# Method 1: Use printf
$ printf "a" > tmp
$ wc -c tmp
1 tmp

# Method 2: Use echo -n
$ echo -n "a" > tmp
$ wc -c tmp
1 tmp

# Method 3: Strip trailing newlines
$ echo "a" | tr -d '\n' > tmp
$ wc -c tmp
1 tmp

The Unix philosophy treats newlines as line terminators rather than separators. This means:

  • Text files should end with a newline
  • Many tools expect this convention
  • POSIX defines a "line" as ending with newline

This explains why echo adds newlines by default.