How to Efficiently Remove Duplicate Emails from Maildir: Message-ID and Content-Based Deduplication Techniques


3 views

Maildir is a popular format for storing emails, but duplicate messages can accumulate over time - especially when using IMAP synchronization. These duplicates waste storage space and make email management more difficult.

The most reliable way to identify duplicates is by checking the Message-ID header, which should be unique for each email. Here's a Python script using the mailbox module to remove duplicates:


import mailbox
import hashlib
from collections import defaultdict

def remove_duplicates(maildir_path):
    seen_ids = set()
    mb = mailbox.Maildir(maildir_path)
    
    for key in mb.keys():
        msg = mb[key]
        msg_id = msg['Message-ID']
        
        if msg_id in seen_ids:
            mb.remove(key)
        else:
            seen_ids.add(msg_id)
    
    mb.close()

When Message-IDs are missing or unreliable, we need content-based comparison. Here's how to handle this:


def content_hash(msg):
    # Normalize the message content
    body = msg.get_payload(decode=True)
    if body:
        body = body.decode('utf-8', errors='ignore').strip().lower()
        return hashlib.md5(body.encode()).hexdigest()
    return None

def remove_content_duplicates(maildir_path):
    seen_hashes = set()
    mb = mailbox.Maildir(maildir_path)
    
    for key in list(mb.keys()):
        msg = mb[key]
        content_hash = content_hash(msg)
        
        if content_hash and content_hash in seen_hashes:
            mb.remove(key)
        else:
            seen_hashes.add(content_hash)
    
    mb.close()

For messages with minor differences (line wrapping, encoding variations), we can implement a fuzzy comparison:


from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio() > 0.95

def fuzzy_deduplicate(maildir_path):
    messages = []
    mb = mailbox.Maildir(maildir_path)
    
    # First pass: collect all messages
    for key in mb.keys():
        messages.append((key, mb[key]))
    
    # Second pass: compare pairwise
    for i, (key1, msg1) in enumerate(messages):
        for j, (key2, msg2) in enumerate(messages[i+1:], i+1):
            if similar(msg1.get_payload(), msg2.get_payload()):
                mb.remove(key2)
    
    mb.close()

Before deleting, you might want to review differences. Here's how to generate a diff:


import difflib

def compare_messages(msg1, msg2):
    lines1 = str(msg1).splitlines()
    lines2 = str(msg2).splitlines()
    return difflib.unified_diff(lines1, lines2, lineterm='')

For large Maildirs, consider these optimizations:

  • Process in batches
  • Use a database to store hashes
  • Implement parallel processing

Always:

  1. Backup your Maildir first
  2. Test with a copy of your data
  3. Consider moving duplicates to a separate folder rather than deleting

When dealing with duplicate emails in Maildir format, we typically encounter two scenarios:

  • Exact duplicates: Messages with identical Message-IDs (RFC 2822 standard)
  • Content duplicates: Messages with different metadata but similar content bodies

The simplest approach uses Message-IDs as unique identifiers. Here's a Python script using the mailbox module:


import mailbox
import hashlib
import os

def remove_duplicates_by_id(maildir_path):
    seen_ids = set()
    
    for root, dirs, files in os.walk(maildir_path):
        for filename in files:
            if filename.startswith('.'):
                continue
                
            filepath = os.path.join(root, filename)
            try:
                with open(filepath, 'r') as f:
                    msg = mailbox.mailbox.Message(f)
                    msg_id = msg['Message-ID']
                    
                    if msg_id in seen_ids:
                        os.unlink(filepath)
                    else:
                        seen_ids.add(msg_id)
            except Exception as e:
                print(f"Error processing {filepath}: {str(e)}")

For more sophisticated detection, we need to compare message bodies. Consider these factors:

  • Normalize line endings (convert all to \n)
  • Ignore header variations (Date, Received, etc.)
  • Handle different encodings (decode to Unicode first)

Here's a content hashing approach:


def get_content_hash(msg):
    # Get normalized body content
    body = msg.get_payload(decode=True)
    if isinstance(body, bytes):
        try:
            body = body.decode('utf-8', errors='ignore')
        except:
            body = body.decode('latin-1', errors='ignore')
    
    # Normalize whitespace and line endings
    body = ' '.join(body.split())
    return hashlib.md5(body.encode('utf-8')).hexdigest()

def remove_content_duplicates(maildir_path):
    seen_hashes = set()
    
    for root, dirs, files in os.walk(maildir_path):
        for filename in files:
            if filename.startswith('.'):
                continue
                
            filepath = os.path.join(root, filename)
            try:
                with open(filepath, 'r') as f:
                    msg = mailbox.mailbox.Message(f)
                    content_hash = get_content_hash(msg)
                    
                    if content_hash in seen_hashes:
                        os.unlink(filepath)
                    else:
                        seen_hashes.add(content_hash)
            except Exception as e:
                print(f"Error processing {filepath}: {str(e)}")

Before mass deletion, it's wise to review candidate duplicates. This shell command helps identify potential duplicates:


find ~/Maildir -type f -name '[^.]*' -exec grep -l "Subject: YourCommonSubject" {} + | xargs ls -lt

For visual diffing, consider using meld or diff -u on message files after stripping headers.

For large mailboxes, consider these optimizations:

  • Use SQLite to track hashes efficiently
  • Implement parallel processing with multiprocessing
  • Add domain-specific rules (e.g., ignore mailing list footers)

Here's a snippet using SQLite for tracking:


import sqlite3

def create_dedupe_db():
    conn = sqlite3.connect('maildedupe.db')
    c = conn.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS messages
                 (hash text primary key, path text)''')
    conn.commit()
    return conn