When dealing with large-scale S3 data migrations within the same AWS region, we face two primary constraints:
- Data transfer costs between S3 buckets in the same region are $0.00 per GB (as of 2023 pricing)
- Request costs for S3 operations still apply (PUT, COPY, DELETE)
For minimal-cost migration of millions of objects, consider these AWS-native approaches:
1. S3 Batch Operations
The most efficient method for large-scale transfers:
aws s3control create-job \
--account-id YOUR_ACCOUNT_ID \
--operation '{"S3PutObjectCopy": {"TargetResource": "arn:aws:s3:::destination-bucket"}}' \
--manifest '{"Spec": {"Format": "S3BatchOperations_CSV_20180820", "Fields": ["Bucket", "Key"]}, "Location": {"ObjectArn": "arn:aws:s3:::source-bucket/manifest.csv", "ETag": "EXAMPLEETAG12345"}}' \
--report '{"Bucket": "arn:aws:s3:::report-bucket", "Prefix": "reports", "Format": "Report_CSV_20180820", "Enabled": true, "ReportScope": "AllTasks"}' \
--priority 10 \
--role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/batch-operations-role \
--region us-east-1
2. AWS SDK Parallel Processing
For programmatic control with Python (boto3):
import boto3
from concurrent.futures import ThreadPoolExecutor
s3 = boto3.client('s3', region_name='us-east-1')
def copy_object(source_key):
copy_source = {'Bucket': 'source-bucket', 'Key': source_key}
s3.copy_object(
CopySource=copy_source,
Bucket='destination-bucket',
Key=f'new-prefix/{source_key}'
)
# Get object list (consider S3 inventory for large buckets)
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='source-bucket')
with ThreadPoolExecutor(max_workers=50) as executor:
for page in pages:
if 'Contents' in page:
for obj in page['Contents']:
executor.submit(copy_object, obj['Key'])
- S3 Inventory First: Generate an S3 inventory report to plan the migration
- Request Minimization: Use COPY operations instead of download/upload
- Lifecycle Policies: Set expiration rules for source objects after verification
After migration, validate using:
aws s3 sync s3://source-bucket/ s3://destination-bucket/ --dryrun
When moving files between S3 buckets in the same AWS region, you'll encounter several cost factors:
- PUT/LIST requests: Charged per thousand operations
- Storage: Temporary storage during transfer
- Data transfer: Free for same-region transfers
For same-region transfers, these approaches minimize costs:
1. S3 Batch Operations with Copy
The most cost-effective solution for millions of files:
aws s3control create-job \
--account-id YOUR_ACCOUNT_ID \
--operation '{"S3PutObjectCopy": {"TargetResource": "arn:aws:s3:::destination-bucket"}}' \
--manifest '{"Spec": {"Format": "S3BatchOperations_CSV_20180820", "Fields": ["Bucket", "Key"]}, "Location": {"ObjectArn": "arn:aws:s3:::source-bucket/manifest.csv", "ETag": "manifest-etag"}}' \
--report '{"Bucket": "arn:aws:s3:::report-bucket", "Prefix": "reports", "Format": "Report_CSV_20180820"}' \
--priority 10 \
--role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/batch-operations-role
2. AWS CLI Sync Command
For smaller transfers or incremental updates:
aws s3 sync s3://source-bucket s3://destination-bucket \
--storage-class STANDARD \
--exclude "*" \
--include "*.txt" \
--metadata-directive COPY
When dealing with millions of files:
Parallel Processing with S3DistCp
Using EMR for extremely large datasets:
s3-dist-cp \
--src s3://source-bucket/prefix \
--dest s3://destination-bucket/prefix \
--deleteOnSuccess \
--targetSize=128 \
--groupBy='.*(\\d\\d\\d\\d-\\d\\d-\\d\\d).*'
Lambda-Based Transfer
For event-driven processing with S3 triggers:
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
for record in event['Records']:
source_bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
copy_source = {'Bucket': source_bucket, 'Key': key}
s3.copy_object(
Bucket='destination-bucket',
Key=key,
CopySource=copy_source,
MetadataDirective='COPY'
)
- Infrequent Access: Set lifecycle policies post-transfer
- Request batching: Use manifests to minimize API calls
- Timing: Schedule transfers during off-peak hours
Always verify transfers with:
aws s3 ls s3://destination-bucket --recursive | wc -l
aws s3 ls s3://source-bucket --recursive | wc -l