When working with file uploads in web applications, many developers encounter an annoying issue: Microsoft Office files (.docx, .xlsx, .pptx) get detected as ZIP archives by libmagic/file. This happens because these file formats are technically ZIP containers with specific internal structures.
The fundamental security concern is that we can't rely solely on user-provided filenames. Consider this dangerous scenario:
# Vulnerable approach - trusting user input
filename = request.files['upload'].filename # "report.docx"
if not filename.endswith('.docx'):
raise InvalidFileTypeError()
The most robust solution is to customize the magic pattern database. Here's how to create a proper magic pattern for .docx files:
# Add to /etc/magic.local or your custom magic file
0 string PK\x03\x04\x14\x00\x06\x00
>0x1E string word/_rels Microsoft Word 2007+
>0x1E string xl/_rels Microsoft Excel 2007+
>0x1E string ppt/_rels Microsoft PowerPoint 2007+
For Python applications using python-magic, you can implement a fallback verification:
import magic
def verify_office_file(file_path):
mime = magic.Magic(mime=True)
detected = mime.from_file(file_path)
if detected == 'application/zip':
# Additional verification for Office files
with open(file_path, 'rb') as f:
header = f.read(30)
if b'word/_rels' in header:
return 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
elif b'xl/_rels' in header:
return 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
elif b'ppt/_rels' in header:
return 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
return detected
If you can't modify system files, consider this dual verification approach:
def is_valid_office_file(uploaded_file):
# First check extension
valid_extensions = {'.docx', '.xlsx', '.pptx'}
if not any(uploaded_file.name.lower().endswith(ext) for ext in valid_extensions):
return False
# Then verify content
mime = magic.Magic()
file_type = mime.from_buffer(uploaded_file.read(1024))
uploaded_file.seek(0) # Rewind for actual processing
return ('Microsoft Word 2007+' in file_type or
'Microsoft Excel 2007+' in file_type or
'Microsoft PowerPoint 2007+' in file_type)
For high-traffic systems, consider these optimizations:
- Cache magic database modifications in memory
- Implement early return when non-ZIP files are detected
- Use minimum sufficient header read (30 bytes is enough for initial check)
For maximum security:
- Implement both magic number verification and extension checking
- Store original filename separately from actual file content
- Consider using specialized libraries like python-ooxml for thorough validation
Modern Office documents (.docx, .xlsx, .pptx) are essentially ZIP archives containing XML files and resources. When using file
command or libmagic
through python-magic
, these files get detected as application/zip rather than their actual Office document types.
In web applications handling file uploads, relying solely on the user-provided filename extension is insecure. We need proper content-type verification to:
- Prevent malicious file uploads disguised as Office documents
- Ensure correct MIME type for downloads
- Maintain proper file type validation in the database
The solution involves creating custom magic rules to properly identify Office Open XML documents. Here's how to implement it:
# Create or edit /etc/magic.local or /usr/share/misc/magic.mgc
0 string PK\x03\x04\x14\x00\x06\x00
>30 string word/ document.xml Microsoft Word 2007+
>30 string ppt/ presentation.xml Microsoft PowerPoint 2007+
>30 string xl/ workbook.xml Microsoft Excel 2007+
>>0x1E leshort 0x0000 (Office Open XML Document)
Here's how to implement custom file detection in Python:
import magic
import os
def detect_filetype(file_path):
mime = magic.Magic(mime=True)
file_type = mime.from_file(file_path)
if file_type == 'application/zip':
# Additional verification for Office documents
with open(file_path, 'rb') as f:
header = f.read(1024)
if b'word/document.xml' in header:
return 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
elif b'xl/workbook.xml' in header:
return 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
elif b'ppt/presentation.xml' in header:
return 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
return file_type
When storing files in a database, include both the verified content type and original filename:
CREATE TABLE uploaded_files (
id SERIAL PRIMARY KEY,
filename VARCHAR(255),
content_type VARCHAR(100),
file_data BYTEA,
verified BOOLEAN DEFAULT FALSE
);
- File command wrapper: Create a shell script that first checks for Office documents specifically
- Dedicated libraries: Use python-ooxml or similar libraries for more thorough validation
- Content sniffing: Combine magic numbers with ZIP content inspection
Always remember that:
- Magic number detection isn't foolproof against deliberately malformed files
- Consider virus scanning for uploaded Office documents
- Implement size limits to prevent ZIP bomb attacks