Cost-Effective S3 Batch Operations: Migrating Millions of Files Between Buckets in Same AWS Region


2 views

When dealing with large-scale S3 data migrations within the same AWS region, we face two primary constraints:

  • Data transfer costs between S3 buckets in the same region are $0.00 per GB (as of 2023 pricing)
  • Request costs for S3 operations still apply (PUT, COPY, DELETE)

For minimal-cost migration of millions of objects, consider these AWS-native approaches:

1. S3 Batch Operations

The most efficient method for large-scale transfers:

aws s3control create-job \
--account-id YOUR_ACCOUNT_ID \
--operation '{"S3PutObjectCopy": {"TargetResource": "arn:aws:s3:::destination-bucket"}}' \
--manifest '{"Spec": {"Format": "S3BatchOperations_CSV_20180820", "Fields": ["Bucket", "Key"]}, "Location": {"ObjectArn": "arn:aws:s3:::source-bucket/manifest.csv", "ETag": "EXAMPLEETAG12345"}}' \
--report '{"Bucket": "arn:aws:s3:::report-bucket", "Prefix": "reports", "Format": "Report_CSV_20180820", "Enabled": true, "ReportScope": "AllTasks"}' \
--priority 10 \
--role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/batch-operations-role \
--region us-east-1

2. AWS SDK Parallel Processing

For programmatic control with Python (boto3):

import boto3
from concurrent.futures import ThreadPoolExecutor

s3 = boto3.client('s3', region_name='us-east-1')

def copy_object(source_key):
    copy_source = {'Bucket': 'source-bucket', 'Key': source_key}
    s3.copy_object(
        CopySource=copy_source,
        Bucket='destination-bucket',
        Key=f'new-prefix/{source_key}'
    )

# Get object list (consider S3 inventory for large buckets)
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='source-bucket')

with ThreadPoolExecutor(max_workers=50) as executor:
    for page in pages:
        if 'Contents' in page:
            for obj in page['Contents']:
                executor.submit(copy_object, obj['Key'])
  • S3 Inventory First: Generate an S3 inventory report to plan the migration
  • Request Minimization: Use COPY operations instead of download/upload
  • Lifecycle Policies: Set expiration rules for source objects after verification

After migration, validate using:

aws s3 sync s3://source-bucket/ s3://destination-bucket/ --dryrun

When moving files between S3 buckets in the same AWS region, you'll encounter several cost factors:

  • PUT/LIST requests: Charged per thousand operations
  • Storage: Temporary storage during transfer
  • Data transfer: Free for same-region transfers

For same-region transfers, these approaches minimize costs:

1. S3 Batch Operations with Copy

The most cost-effective solution for millions of files:

aws s3control create-job \
--account-id YOUR_ACCOUNT_ID \
--operation '{"S3PutObjectCopy": {"TargetResource": "arn:aws:s3:::destination-bucket"}}' \
--manifest '{"Spec": {"Format": "S3BatchOperations_CSV_20180820", "Fields": ["Bucket", "Key"]}, "Location": {"ObjectArn": "arn:aws:s3:::source-bucket/manifest.csv", "ETag": "manifest-etag"}}' \
--report '{"Bucket": "arn:aws:s3:::report-bucket", "Prefix": "reports", "Format": "Report_CSV_20180820"}' \
--priority 10 \
--role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/batch-operations-role

2. AWS CLI Sync Command

For smaller transfers or incremental updates:

aws s3 sync s3://source-bucket s3://destination-bucket \
--storage-class STANDARD \
--exclude "*" \
--include "*.txt" \
--metadata-directive COPY

When dealing with millions of files:

Parallel Processing with S3DistCp

Using EMR for extremely large datasets:

s3-dist-cp \
--src s3://source-bucket/prefix \
--dest s3://destination-bucket/prefix \
--deleteOnSuccess \
--targetSize=128 \
--groupBy='.*(\\d\\d\\d\\d-\\d\\d-\\d\\d).*'

Lambda-Based Transfer

For event-driven processing with S3 triggers:

import boto3
s3 = boto3.client('s3')

def lambda_handler(event, context):
    for record in event['Records']:
        source_bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']
        copy_source = {'Bucket': source_bucket, 'Key': key}
        s3.copy_object(
            Bucket='destination-bucket',
            Key=key,
            CopySource=copy_source,
            MetadataDirective='COPY'
        )
  • Infrequent Access: Set lifecycle policies post-transfer
  • Request batching: Use manifests to minimize API calls
  • Timing: Schedule transfers during off-peak hours

Always verify transfers with:

aws s3 ls s3://destination-bucket --recursive | wc -l
aws s3 ls s3://source-bucket --recursive | wc -l