How to Make libmagic/file Correctly Detect .docx, .xlsx, and .pptx Files Instead of ZIP Format


1 views

When working with file uploads in web applications, many developers encounter an annoying issue: Microsoft Office files (.docx, .xlsx, .pptx) get detected as ZIP archives by libmagic/file. This happens because these file formats are technically ZIP containers with specific internal structures.

The fundamental security concern is that we can't rely solely on user-provided filenames. Consider this dangerous scenario:


# Vulnerable approach - trusting user input
filename = request.files['upload'].filename  # "report.docx"
if not filename.endswith('.docx'):
    raise InvalidFileTypeError()

The most robust solution is to customize the magic pattern database. Here's how to create a proper magic pattern for .docx files:


# Add to /etc/magic.local or your custom magic file
0       string          PK\x03\x04\x14\x00\x06\x00
>0x1E   string          word/_rels         Microsoft Word 2007+
>0x1E   string          xl/_rels           Microsoft Excel 2007+
>0x1E   string          ppt/_rels          Microsoft PowerPoint 2007+

For Python applications using python-magic, you can implement a fallback verification:


import magic

def verify_office_file(file_path):
    mime = magic.Magic(mime=True)
    detected = mime.from_file(file_path)
    
    if detected == 'application/zip':
        # Additional verification for Office files
        with open(file_path, 'rb') as f:
            header = f.read(30)
            if b'word/_rels' in header:
                return 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
            elif b'xl/_rels' in header:
                return 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
            elif b'ppt/_rels' in header:
                return 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
    
    return detected

If you can't modify system files, consider this dual verification approach:


def is_valid_office_file(uploaded_file):
    # First check extension
    valid_extensions = {'.docx', '.xlsx', '.pptx'}
    if not any(uploaded_file.name.lower().endswith(ext) for ext in valid_extensions):
        return False
    
    # Then verify content
    mime = magic.Magic()
    file_type = mime.from_buffer(uploaded_file.read(1024))
    uploaded_file.seek(0)  # Rewind for actual processing
    
    return ('Microsoft Word 2007+' in file_type or 
            'Microsoft Excel 2007+' in file_type or 
            'Microsoft PowerPoint 2007+' in file_type)

For high-traffic systems, consider these optimizations:

  • Cache magic database modifications in memory
  • Implement early return when non-ZIP files are detected
  • Use minimum sufficient header read (30 bytes is enough for initial check)

For maximum security:

  1. Implement both magic number verification and extension checking
  2. Store original filename separately from actual file content
  3. Consider using specialized libraries like python-ooxml for thorough validation

Modern Office documents (.docx, .xlsx, .pptx) are essentially ZIP archives containing XML files and resources. When using file command or libmagic through python-magic, these files get detected as application/zip rather than their actual Office document types.

In web applications handling file uploads, relying solely on the user-provided filename extension is insecure. We need proper content-type verification to:

  • Prevent malicious file uploads disguised as Office documents
  • Ensure correct MIME type for downloads
  • Maintain proper file type validation in the database

The solution involves creating custom magic rules to properly identify Office Open XML documents. Here's how to implement it:

# Create or edit /etc/magic.local or /usr/share/misc/magic.mgc
0 string PK\x03\x04\x14\x00\x06\x00
>30 string word/ document.xml Microsoft Word 2007+
>30 string ppt/ presentation.xml Microsoft PowerPoint 2007+
>30 string xl/ workbook.xml Microsoft Excel 2007+
>>0x1E leshort 0x0000 (Office Open XML Document)

Here's how to implement custom file detection in Python:

import magic
import os

def detect_filetype(file_path):
    mime = magic.Magic(mime=True)
    file_type = mime.from_file(file_path)
    
    if file_type == 'application/zip':
        # Additional verification for Office documents
        with open(file_path, 'rb') as f:
            header = f.read(1024)
            if b'word/document.xml' in header:
                return 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
            elif b'xl/workbook.xml' in header:
                return 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
            elif b'ppt/presentation.xml' in header:
                return 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
    
    return file_type

When storing files in a database, include both the verified content type and original filename:

CREATE TABLE uploaded_files (
    id SERIAL PRIMARY KEY,
    filename VARCHAR(255),
    content_type VARCHAR(100),
    file_data BYTEA,
    verified BOOLEAN DEFAULT FALSE
);
  • File command wrapper: Create a shell script that first checks for Office documents specifically
  • Dedicated libraries: Use python-ooxml or similar libraries for more thorough validation
  • Content sniffing: Combine magic numbers with ZIP content inspection

Always remember that:

  • Magic number detection isn't foolproof against deliberately malformed files
  • Consider virus scanning for uploaded Office documents
  • Implement size limits to prevent ZIP bomb attacks