Optimizing AWS S3 Backup Strategies: Handling 3M Files with Rsync and Cross-Region Replication


3 views

While Amazon S3 boasts 99.999999999% (11 nines) durability, this doesn't eliminate the need for backups. Durability protects against hardware failures, but human errors (like accidental deletions), malicious attacks, or regional outages still require backup solutions. For business-critical assets like product images, a multi-layered approach is recommended.

When dealing with 3 million small files, traditional rsync becomes inefficient due to:

  • Latency in S3 API calls for each file comparison
  • Network overhead from numerous small requests
  • Lack of parallelization in default rsync

1. S3 Batch Operations

For initial full backup:

aws s3 cp s3://source-bucket/ s3://backup-bucket/ --recursive

2. S3 Versioning + Lifecycle Policies

Enable versioning and set lifecycle rules:

aws s3api put-bucket-versioning --bucket your-bucket \
--versioning-configuration Status=Enabled

aws s3api put-bucket-lifecycle-configuration \
--bucket your-bucket \
--lifecycle-configuration file://lifecycle.json

3. Cross-Region Replication (CRR)

Configure replication rules:

{
  "Role": "arn:aws:iam::account-id:role/CRRRole",
  "Rules": [
    {
      "Status": "Enabled",
      "Priority": 1,
      "DeleteMarkerReplication": { "Status": "Disabled" },
      "Destination": { "Bucket": "arn:aws:s3:::destination-bucket" },
      "Filter": { "Prefix": "" }
    }
  ]
}

For incremental backups using rsync:

#!/bin/bash
# Mount S3 via s3fs
s3fs source-bucket /mnt/source -o passwd_file=${HOME}/.passwd-s3fs \
-o use_cache=/tmp -o allow_other -o umask=0002

# Use parallel rsync
find /mnt/source -type f | parallel -j 20 rsync -azR {} /backup/destination/

# Alternative: Use s5cmd for high-speed transfers
s5cmd sync s3://source-bucket/* s3://backup-bucket/

Implement checksum verification:

aws s3api list-objects --bucket source-bucket --query 'Contents[].{Key: Key, ETag: ETag}' > source_etags.json
aws s3api list-objects --bucket backup-bucket --query 'Contents[].{Key: Key, ETag: ETag}' > backup_etags.json
diff source_etags.json backup_etags.json

Consider these S3 storage classes for backups:

  • S3 Standard-IA for frequently accessed backups
  • S3 Glacier Instant Retrieval for archival
  • S3 One Zone-IA for non-critical copies (33% cheaper)

While Amazon S3 boasts 99.999999999% (11 9's) durability, real-world scenarios demand additional protection. Consider these cases where S3 alone isn't enough:

  • Accidental deletions (even with versioning enabled)
  • Ransomware or malicious overwrites
  • Region-wide outages (though extremely rare)
  • Compliance requirements mandating offline copies

Your current approach highlights a common pain point. Rsync performs full scans for 3M files, creating unnecessary overhead. Here's why:


# Typical rsync command (inefficient for large S3 mounts)
rsync -avz --delete /mnt/s3-bucket/ /backup/location/

The fundamental issues:

  1. Filesystem metadata operations are expensive over S3 mounts
  2. No native change tracking between syncs
  3. Network latency compounds with millions of files

Option 1: S3 Batch Operations + Lambda


# Sample AWS CLI command for batch copy
aws s3control create-job \
    --account-id 123456789012 \
    --operation '{"S3PutObjectCopy": {"TargetResource": "arn:aws:s3:::backup-bucket"}}' \
    --manifest '{"Spec": {"Format": "S3BatchOperations_CSV_20180820", "Fields": ["Bucket", "Key"]}, "Location": {"ObjectArn": "arn:aws:s3:::manifest-bucket/manifest.csv", "ETag": "exampleETag"}}' \
    --report '{"Bucket": "arn:aws:s3:::report-bucket", "Prefix": "reports", "Format": "Report_CSV_20180820", "Enabled": true, "ReportScope": "AllTasks"}'

Option 2: S3 Inventory + Incremental Sync

First generate inventory reports:


aws s3api put-bucket-inventory-configuration \
    --bucket source-bucket \
    --id config-id \
    --inventory-configuration file://inventory-config.json

Then process with this Python script:


import boto3
import pandas as pd

s3 = boto3.client('s3')

def incremental_sync():
    # Download latest inventory
    inventory = pd.read_csv('s3://source-bucket/inventory/...')
    
    # Compare with previous state
    previous = pd.read_csv('local_state.csv')
    
    # Find new/changed files
    changes = pd.concat([inventory, previous]).drop_duplicates(keep=False)
    
    # Sync only changes
    for _, row in changes.iterrows():
        s3.download_file('source-bucket', row['Key'], f"/backup/{row['Key']}")

    # Update local state
    inventory.to_csv('local_state.csv', index=False)
Strategy Frequency Cost Estimate Recovery Time
S3 Cross-Region Replication Real-time $$$ Minutes
Glacier Deep Archive Monthly $ Hours
EC2-based EBS Snapshots Daily $$ Minutes

Implement these CloudWatch alarms:


aws cloudwatch put-metric-alarm \
    --alarm-name Backup-Failure \
    --metric-name NumberOfObjects \
    --namespace AWS/S3 \
    --statistic Average \
    --period 86400 \
    --evaluation-periods 1 \
    --threshold 100 \
    --comparison-operator LessThanThreshold \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:Backup-Alerts