How to Recursively Update Content-Type for Specific File Extensions in Amazon S3 Buckets


2 views

When working with static files in Amazon S3, we often encounter improperly set Content-Type headers, particularly with JSON and Markdown files that default to text/plain. While setting proper defaults for new uploads solves future problems, we still need to address existing objects.

The key to updating metadata is using S3's copy operation. Even when copying an object to itself, we can modify metadata like Content-Type. Here's the basic pattern:

s3.copy_object(
    Bucket=bucket_name,
    Key=object_key,
    CopySource={'Bucket': bucket_name, 'Key': object_key},
    ContentType=new_content_type,
    MetadataDirective='REPLACE'
)

We'll use Python with Boto3 to handle the recursive processing. The solution involves:

import boto3
from botocore.exceptions import ClientError

s3 = boto3.client('s3')

def update_content_types(bucket, prefix='', extensions={}):
    paginator = s3.get_paginator('list_objects_v2')
    
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get('Contents', []):
            key = obj['Key']
            
            # Skip directories
            if key.endswith('/'):
                continue
                
            # Check file extension
            for ext, content_type in extensions.items():
                if key.lower().endswith(ext.lower()):
                    try:
                        s3.copy_object(
                            Bucket=bucket,
                            Key=key,
                            CopySource={'Bucket': bucket, 'Key': key},
                            ContentType=content_type,
                            MetadataDirective='REPLACE'
                        )
                        print(f"Updated {key} to {content_type}")
                    except ClientError as e:
                        print(f"Error updating {key}: {e}")
                    break

# Example usage
update_content_types(
    bucket='my-bucket',
    extensions={
        '.json': 'application/json',
        '.md': 'text/markdown'
    }
)

For buckets with millions of objects:

  • Add error handling for throttling and retries
  • Consider parallel processing with threads
  • Log progress to track processed files
  • Run during low-traffic periods

For extremely large buckets, consider Amazon S3 Batch Operations:

  1. Create a manifest of all objects needing updates
  2. Create a Lambda function to handle the Content-Type update
  3. Execute the batch job through AWS CLI or Console

After running the script, verify changes with:

aws s3api head-object --bucket my-bucket --key path/to/file.json

Check the ContentType field in the output to confirm the update was successful.


When dealing with static files in Amazon S3, you might encounter situations where files aren't served with the correct Content-Type headers. This is particularly common with file types like JSON and Markdown, which often default to text/plain instead of their proper MIME types (application/json and text/markdown respectively).

Incorrect Content-Type headers can cause:

  • Browser handling issues (showing raw JSON instead of formatted output)
  • API compatibility problems
  • SEO impacts for documentation sites

Here's how to update Content-Type headers for existing files in bulk using AWS CLI and Python:

Option 1: Using AWS CLI

aws s3 cp s3://your-bucket/ s3://your-bucket/ --recursive \
--exclude "*" \
--include "*.json" \
--metadata-directive REPLACE \
--content-type "application/json"

Option 2: Python Script with Boto3

For more control and error handling:

import boto3
from botocore.exceptions import ClientError

s3 = boto3.client('s3')

def update_content_types(bucket_name, extensions_mapping):
    paginator = s3.get_paginator('list_objects_v2')
    
    for ext, content_type in extensions_mapping.items():
        operation_parameters = {
            'Bucket': bucket_name,
            'Prefix': ''
        }
        
        for page in paginator.paginate(**operation_parameters):
            if 'Contents' not in page:
                continue
                
            for obj in page['Contents']:
                if obj['Key'].endswith(ext):
                    try:
                        s3.copy_object(
                            Bucket=bucket_name,
                            Key=obj['Key'],
                            CopySource={'Bucket': bucket_name, 'Key': obj['Key']},
                            MetadataDirective='REPLACE',
                            ContentType=content_type,
                            ACL='bucket-owner-full-control'
                        )
                        print(f"Updated {obj['Key']} to {content_type}")
                    except ClientError as e:
                        print(f"Error updating {obj['Key']}: {e}")

# Example usage
update_content_types('your-bucket', {
    '.json': 'application/json',
    '.md': 'text/markdown'
})

When dealing with large buckets:

  • Run updates during off-peak hours
  • Consider using S3 Batch Operations for extremely large buckets
  • Monitor AWS costs as these operations count against your request quotas

If you're using CloudFront:

  1. Set up Lambda@Edge to modify headers at request time
  2. Use CloudFront Functions to rewrite Content-Type headers