How to Parse and Decode Quoted-Printable Maildir Messages in Linux/Python


2 views

When working with Maildir format messages (like those generated by fakemail), each email exists as a separate file containing raw headers and body. The key challenge appears when dealing with quoted-printable encoded messages containing special characters and soft line breaks.

The example German email demonstrates two characteristic features of quoted-printable encoding:

1. Soft line breaks indicated by "=" at end of line
2. Special characters encoded as =XX hexadecimal sequences
3. UTF-8 content needing proper decoding

Here's a complete Python solution using the standard library:

import email
import email.policy
from email import policy

def read_maildir_message(filepath):
    with open(filepath, 'rb') as f:
        msg = email.message_from_binary_file(f, policy=policy.default)
    
    # Handle multipart messages
    if msg.is_multipart():
        for part in msg.walk():
            content_type = part.get_content_type()
            if content_type == 'text/plain':
                payload = part.get_payload(decode=True)
                charset = part.get_content_charset() or 'utf-8'
                return payload.decode(charset)
    else:
        payload = msg.get_payload(decode=True)
        charset = msg.get_content_charset() or 'utf-8'
        return payload.decode(charset)

For those working in Perl environments:

use Email::MIME;
use Encode;

sub parse_maildir_message {
    my ($file) = @_;
    open my $fh, '<:raw', $file or die $!;
    my $email = Email::MIME->new(do { local $/; <$fh> });
    
    my $body = $email->body;
    my $charset = $email->charset || 'UTF-8';
    
    return decode($charset, $body);
}

For quick inspection without writing code:

# Using munpack (part of mpack package)
munpack message_file

# Using reformime (from maildrop package)
reformime -e < message_file

# Using Python one-liner
python3 -c "import quopri; print(quopri.decodestring(open('message_file').read()).decode('utf-8'))"

Some additional considerations for production code:

- Malformed quoted-printable sequences
- Multiple character encodings in single message
- Very long lines (some clients don't properly soft-wrap)
- Mixed content types (HTML + plaintext)
- Messages without explicit charset declaration

Create test cases with these challenging patterns:

1. Lines with = at end but not soft breaks (e.g. "x=1")
2. Invalid hex sequences (=GH)
3. Multiple consecutive soft breaks
4. Different line ending styles (CRLF vs LF)
5. Encoded words in headers (=?utf-8?q?...) 

When working with mail servers or testing email functionality, many developers use Maildir format for storing individual email messages. Each message is stored as a separate file with encoded content. The quoted-printable (QP) encoding is commonly used for non-ASCII characters and line breaks in email messages.

Linux provides several command-line utilities for handling quoted-printable encoding:


# Using formail (from procmail package)
formail -e < mailfile

# Using perl's MIME::QuotedPrint
perl -MMIME::QuotedPrint -e 'print decode_qp(join "", <>)' mailfile

# Using qprint (standalone decoder)
qprint -d < mailfile

For developers needing to process these files programmatically, here's a Python solution:


import email
import email.policy
from email import policy

def decode_maildir_message(filepath):
    with open(filepath, 'rb') as f:
        msg = email.message_from_binary_file(f, policy=policy.default)
    
    # Handle multipart messages
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == 'text/plain':
                payload = part.get_payload(decode=True)
                charset = part.get_content_charset() or 'utf-8'
                return payload.decode(charset)
    else:
        payload = msg.get_payload(decode=True)
        charset = msg.get_content_charset() or 'utf-8'
        return payload.decode(charset)

# Usage example
print(decode_maildir_message('/path/to/mailfile'))

When processing real-world email files, you might encounter these scenarios:


# 1. Messages with soft line breaks (ending with =)
def fix_soft_linebreaks(text):
    return text.replace('=\n', '')

# 2. Messages with multiple encodings
def decode_with_fallback(payload, charset):
    try:
        return payload.decode(charset)
    except UnicodeDecodeError:
        return payload.decode('latin-1')  # Common fallback

For quick debugging or shell scripting:


#!/bin/bash

# Extract and decode the message body
cat mailfile | sed '1,/^$/d' | qprint -d

# Or using awk for simple cases
awk 'BEGIN{ORS="";}/^$/{body=1;next}body{print}' mailfile | qprint -d

Let's process the example message from the question:


import quopri

message = """Message-ID: <1317977606.4e8ebe06ceab7@myserver.local>
Date: Fri, 07 Oct 2011 10:53:26 +0200
Subject: Registrierung
From: me@me.com
To: tt99@example.com
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hallo,

Sie haben sich auf Meinserver.de als Benutzer regist=
riert. Um Ihre
Registrierung abzuschlie=C3=9Fen, klicken Sie auf folg=
enden Link:

http://meinserver.de/benutzer/bestaetigen/3lk6lp=
ga1kcgcg484kc8ksg"""

# Extract body and decode
body = message.split('\n\n', 1)[1]
decoded = quopri.decodestring(body.replace('=\n', '')).decode('utf-8')
print(decoded)

The output will properly display the German characters and fix the line breaks.