Maildir is a popular format for storing emails, but duplicate messages can accumulate over time - especially when using IMAP synchronization. These duplicates waste storage space and make email management more difficult.
The most reliable way to identify duplicates is by checking the Message-ID header, which should be unique for each email. Here's a Python script using the mailbox
module to remove duplicates:
import mailbox
import hashlib
from collections import defaultdict
def remove_duplicates(maildir_path):
seen_ids = set()
mb = mailbox.Maildir(maildir_path)
for key in mb.keys():
msg = mb[key]
msg_id = msg['Message-ID']
if msg_id in seen_ids:
mb.remove(key)
else:
seen_ids.add(msg_id)
mb.close()
When Message-IDs are missing or unreliable, we need content-based comparison. Here's how to handle this:
def content_hash(msg):
# Normalize the message content
body = msg.get_payload(decode=True)
if body:
body = body.decode('utf-8', errors='ignore').strip().lower()
return hashlib.md5(body.encode()).hexdigest()
return None
def remove_content_duplicates(maildir_path):
seen_hashes = set()
mb = mailbox.Maildir(maildir_path)
for key in list(mb.keys()):
msg = mb[key]
content_hash = content_hash(msg)
if content_hash and content_hash in seen_hashes:
mb.remove(key)
else:
seen_hashes.add(content_hash)
mb.close()
For messages with minor differences (line wrapping, encoding variations), we can implement a fuzzy comparison:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio() > 0.95
def fuzzy_deduplicate(maildir_path):
messages = []
mb = mailbox.Maildir(maildir_path)
# First pass: collect all messages
for key in mb.keys():
messages.append((key, mb[key]))
# Second pass: compare pairwise
for i, (key1, msg1) in enumerate(messages):
for j, (key2, msg2) in enumerate(messages[i+1:], i+1):
if similar(msg1.get_payload(), msg2.get_payload()):
mb.remove(key2)
mb.close()
Before deleting, you might want to review differences. Here's how to generate a diff:
import difflib
def compare_messages(msg1, msg2):
lines1 = str(msg1).splitlines()
lines2 = str(msg2).splitlines()
return difflib.unified_diff(lines1, lines2, lineterm='')
For large Maildirs, consider these optimizations:
- Process in batches
- Use a database to store hashes
- Implement parallel processing
Always:
- Backup your Maildir first
- Test with a copy of your data
- Consider moving duplicates to a separate folder rather than deleting
When dealing with duplicate emails in Maildir format, we typically encounter two scenarios:
- Exact duplicates: Messages with identical Message-IDs (RFC 2822 standard)
- Content duplicates: Messages with different metadata but similar content bodies
The simplest approach uses Message-IDs as unique identifiers. Here's a Python script using the mailbox
module:
import mailbox
import hashlib
import os
def remove_duplicates_by_id(maildir_path):
seen_ids = set()
for root, dirs, files in os.walk(maildir_path):
for filename in files:
if filename.startswith('.'):
continue
filepath = os.path.join(root, filename)
try:
with open(filepath, 'r') as f:
msg = mailbox.mailbox.Message(f)
msg_id = msg['Message-ID']
if msg_id in seen_ids:
os.unlink(filepath)
else:
seen_ids.add(msg_id)
except Exception as e:
print(f"Error processing {filepath}: {str(e)}")
For more sophisticated detection, we need to compare message bodies. Consider these factors:
- Normalize line endings (convert all to \n)
- Ignore header variations (Date, Received, etc.)
- Handle different encodings (decode to Unicode first)
Here's a content hashing approach:
def get_content_hash(msg):
# Get normalized body content
body = msg.get_payload(decode=True)
if isinstance(body, bytes):
try:
body = body.decode('utf-8', errors='ignore')
except:
body = body.decode('latin-1', errors='ignore')
# Normalize whitespace and line endings
body = ' '.join(body.split())
return hashlib.md5(body.encode('utf-8')).hexdigest()
def remove_content_duplicates(maildir_path):
seen_hashes = set()
for root, dirs, files in os.walk(maildir_path):
for filename in files:
if filename.startswith('.'):
continue
filepath = os.path.join(root, filename)
try:
with open(filepath, 'r') as f:
msg = mailbox.mailbox.Message(f)
content_hash = get_content_hash(msg)
if content_hash in seen_hashes:
os.unlink(filepath)
else:
seen_hashes.add(content_hash)
except Exception as e:
print(f"Error processing {filepath}: {str(e)}")
Before mass deletion, it's wise to review candidate duplicates. This shell command helps identify potential duplicates:
find ~/Maildir -type f -name '[^.]*' -exec grep -l "Subject: YourCommonSubject" {} + | xargs ls -lt
For visual diffing, consider using meld
or diff -u
on message files after stripping headers.
For large mailboxes, consider these optimizations:
- Use SQLite to track hashes efficiently
- Implement parallel processing with multiprocessing
- Add domain-specific rules (e.g., ignore mailing list footers)
Here's a snippet using SQLite for tracking:
import sqlite3
def create_dedupe_db():
conn = sqlite3.connect('maildedupe.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS messages
(hash text primary key, path text)''')
conn.commit()
return conn