Efficient Incremental Backups to S3: Technical Comparison of rsync, s3rsync, and Duplicity Implementations


2 views

Unlike traditional filesystems, Amazon S3 operates through REST API calls rather than a persistent filesystem daemon. This fundamental architectural difference creates unique challenges for incremental backup operations that tools like rsync were specifically designed to handle.

Option 1: s3rsync wrapper approach


# Basic s3rsync usage example
s3rsync -a --delete /local/path/ s3://bucket/path/

This method essentially translates rsync operations into S3 API calls. While convenient, it introduces potential points of failure through the translation layer.

Option 2: Duplicity's native S3 support


# Duplicity S3 backup command
duplicity /path/to/source s3://bucket_name/path/to/target

Duplicity maintains a local manifest file containing cryptographic signatures of backed-up files. For subsequent runs, it compares these signatures against both the local files and remote S3 objects.

The rsync approach requires:

  • Full directory tree scanning for each operation
  • HTTP HEAD requests to check existing objects
  • Multipart uploads for large files

Duplicity's method involves:

  • Local signature database maintenance
  • Periodic full backups (chain breaks)
  • GPG encryption overhead

Here's a Python snippet demonstrating a lightweight S3 incremental backup approach using boto3:


import boto3
import os
import hashlib

s3 = boto3.client('s3')
manifest = {}

def calculate_etag(file_path):
    """Calculate MD5 similar to S3 ETag for comparison"""
    with open(file_path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

def sync_directory(local_path, bucket):
    for root, dirs, files in os.walk(local_path):
        for file in files:
            local_file = os.path.join(root, file)
            s3_key = os.path.relpath(local_file, local_path)
            
            # Check if file exists and matches ETag
            try:
                head = s3.head_object(Bucket=bucket, Key=s3_key)
                if head['ETag'].strip('"') == calculate_etag(local_file):
                    continue  # Skip identical files
            except:
                pass  # File doesn't exist remotely
            
            # Upload changed/new file
            s3.upload_file(local_file, bucket, s3_key)

When to choose s3rsync:

  • Require exact rsync behavior
  • Have existing rsync workflows
  • Can accept third-party dependencies

When Duplicity shines:

  • Need native S3 support
  • Want encryption capabilities
  • Prefer self-contained solution

For large-scale operations, AWS S3 Batch can process millions of objects:


# JSON manifest for S3 Batch
{
  "Rules": [
    {
      "Id": "BackupRule",
      "Status": "Enabled",
      "Filter": {
        "AndOperator": {
          "Prefix": "photos/",
          "Tags": [{"Key": "Backup", "Value": "Incremental"}]
        }
      }
    }
  ]
}

When backing up large datasets like image repositories to S3, we face a fundamental protocol mismatch. Traditional rsync relies on a daemon process for delta calculations, while S3 operates as a dumb HTTP object store. This creates several technical constraints:


// Pseudo-code of rsync's delta algorithm
function calculate_delta(local_file, remote_file) {
  // Requires remote file access and checksum comparison
  // S3 has no native protocol for this operation
}

The s3rsync method essentially wraps the traditional rsync protocol with S3 compatibility layers. Under the hood:

  • Maintains local manifest files tracking file states
  • Uses multipart uploads for large files
  • Implements custom checksum comparison

# Example s3rsync command
s3rsync --checksum --delete /local/path/ s3://bucket/path/

Duplicity takes a different architectural approach by:

  • Storing incremental chain metadata in S3 itself
  • Using GPG encryption by default
  • Implementing its own diff algorithm

# Duplicity backup command example
duplicity full --encrypt-key=ABCD1234 /data s3://bucket/backup
duplicity incr --encrypt-key=ABCD1234 /data s3://bucket/backup
Metric s3rsync Duplicity
Initial Backup Faster (parallel uploads) Slower (full encryption)
Incrementals Requires local cache Self-contained in S3
Restores Simple file retrieval Must rebuild from chain

For production-grade implementations, consider these optimizations:


# AWS CLI configuration for multipart transfers
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 16MB

# Duplicity bandwidth throttling
duplicity --bandwidth-limit 1024 /data s3://bucket/backup

For large-scale operations, you might implement:


#!/bin/bash
# Generate local checksums
find /data -type f -exec md5sum {} + > local_checksums.txt

# Compare with S3 inventory
aws s3api list-objects-v2 --bucket my-bucket \
  --query "Contents[].{Key:Key,ETag:ETag}" \
  --output json > s3_inventory.json

# Custom delta calculation
python3 calculate_deltas.py local_checksums.txt s3_inventory.json