Streaming PostgreSQL Backups Directly to S3: Optimizing pg_dump for Large Databases

When dealing with multi-terabyte PostgreSQL databases, traditional backup approaches quickly become problematic. The standard pg_dump process requires temporary disk space equal to your database size - an expensive requirement when working with cloud instances.

Direct streaming to S3 offers three key advantages:

Eliminates local storage bottlenecks
Reduces backup time by parallelizing uploads
Provides immediate cloud storage without intermediate steps

Method 1: Using pipes with AWS CLI

pg_dump -U username -d dbname -Fc | \
aws s3 cp - s3://your-bucket/backup-$(date +%Y%m%d).dump

Key points:

-Fc specifies custom format (compressed by default)
The pipe (|) streams directly without temp files
Date stamp ensures unique filenames

Method 2: Multi-part Upload for Huge Databases

pg_dump -U username -d dbname | \
split --bytes=500MB - backup_part_ && \
for part in backup_part_*; do \
  aws s3 cp $part s3://your-bucket/multipart/$part; \
done

Network bandwidth often becomes the limiting factor. Consider these optimizations:

Use pg_dump --jobs=N for parallel dumps (PostgreSQL 9.3+)
Compress during transfer: pg_dump | gzip | aws s3 cp -
For EC2 instances, use S3 VPC endpoints

Always:

Use IAM roles instead of access keys
Enable S3 server-side encryption
Set proper bucket policies
Consider pg_dump --blobs for sensitive binary data

For enterprise-grade solutions, consider wal-g which provides:

Incremental backups
Point-in-time recovery
Built-in compression/encryption

wal-g backup-push /var/lib/postgresql/9.6/main \
  --config=/etc/wal-g/config.json

When dealing with large PostgreSQL databases (9.6.3 in this case), traditional backup methods that write dump files to local storage before transferring to S3 become problematic. The primary issues are:

Local disk space consumption during backup operations
Extended backup windows due to two-step process (dump then upload)
Potential storage bottlenecks on the database server

The optimal approach leverages Unix pipes to stream pg_dump output directly to S3 without intermediate storage. Here's the technical breakdown:


# Basic streaming command structure
pg_dump -Fc -Z 9 -h localhost -U username dbname | \
aws s3 cp - s3://your-bucket/backup-$(date +%Y%m%d).dump

For production environments, we should enhance reliability and monitoring:


#!/bin/bash

# Configuration
DB_HOST="localhost"
DB_USER="backup_user"
DB_NAME="production_db"
S3_BUCKET="db-backups-$(date +%Y-%m)"
BACKUP_FILE="pgdump_$(date +%Y%m%d_%H%M%S).custom"

# Execute streaming backup
if pg_dump -Fc -Z 9 -h $DB_HOST -U $DB_USER $DB_NAME | \
   aws s3 cp - "s3://${S3_BUCKET}/${BACKUP_FILE}"; then
   
   echo "Backup successful: ${BACKUP_FILE}"
   # Add SNS notification or CloudWatch metric here
else
   echo "Backup failed!" >&2
   exit 1
fi

Compression: Use -Z 9 for maximum compression (custom format only)
Network: Ensure sufficient bandwidth between DB server and S3
Parallelism: For very large DBs, consider pg_dump's -j option

If AWS CLI isn't available, s3cmd works similarly:


pg_dump -Fc dbname | s3cmd put - s3://bucket/backup.dump

Always validate your backups:


aws s3 cp s3://your-bucket/backup.dump - | pg_restore -l

Use IAM roles instead of access keys when possible
Enable S3 server-side encryption (SSE-S3 or SSE-KMS)
Implement S3 bucket policies for backup retention

ServerDevWorker

Streaming PostgreSQL Backups Directly to S3: Optimizing pg_dump for Large Databases

Method 1: Using pipes with AWS CLI

Method 2: Multi-part Upload for Huge Databases

Related Articles