When dealing with multi-terabyte PostgreSQL databases, traditional backup approaches quickly become problematic. The standard pg_dump
process requires temporary disk space equal to your database size - an expensive requirement when working with cloud instances.
Direct streaming to S3 offers three key advantages:
- Eliminates local storage bottlenecks
- Reduces backup time by parallelizing uploads
- Provides immediate cloud storage without intermediate steps
Method 1: Using pipes with AWS CLI
pg_dump -U username -d dbname -Fc | \
aws s3 cp - s3://your-bucket/backup-$(date +%Y%m%d).dump
Key points:
-Fc
specifies custom format (compressed by default)- The pipe (
|
) streams directly without temp files - Date stamp ensures unique filenames
Method 2: Multi-part Upload for Huge Databases
pg_dump -U username -d dbname | \
split --bytes=500MB - backup_part_ && \
for part in backup_part_*; do \
aws s3 cp $part s3://your-bucket/multipart/$part; \
done
Network bandwidth often becomes the limiting factor. Consider these optimizations:
- Use
pg_dump --jobs=N
for parallel dumps (PostgreSQL 9.3+) - Compress during transfer:
pg_dump | gzip | aws s3 cp -
- For EC2 instances, use S3 VPC endpoints
Always:
- Use IAM roles instead of access keys
- Enable S3 server-side encryption
- Set proper bucket policies
- Consider
pg_dump --blobs
for sensitive binary data
For enterprise-grade solutions, consider wal-g which provides:
- Incremental backups
- Point-in-time recovery
- Built-in compression/encryption
wal-g backup-push /var/lib/postgresql/9.6/main \
--config=/etc/wal-g/config.json
When dealing with large PostgreSQL databases (9.6.3 in this case), traditional backup methods that write dump files to local storage before transferring to S3 become problematic. The primary issues are:
- Local disk space consumption during backup operations
- Extended backup windows due to two-step process (dump then upload)
- Potential storage bottlenecks on the database server
The optimal approach leverages Unix pipes to stream pg_dump output directly to S3 without intermediate storage. Here's the technical breakdown:
# Basic streaming command structure
pg_dump -Fc -Z 9 -h localhost -U username dbname | \
aws s3 cp - s3://your-bucket/backup-$(date +%Y%m%d).dump
For production environments, we should enhance reliability and monitoring:
#!/bin/bash
# Configuration
DB_HOST="localhost"
DB_USER="backup_user"
DB_NAME="production_db"
S3_BUCKET="db-backups-$(date +%Y-%m)"
BACKUP_FILE="pgdump_$(date +%Y%m%d_%H%M%S).custom"
# Execute streaming backup
if pg_dump -Fc -Z 9 -h $DB_HOST -U $DB_USER $DB_NAME | \
aws s3 cp - "s3://${S3_BUCKET}/${BACKUP_FILE}"; then
echo "Backup successful: ${BACKUP_FILE}"
# Add SNS notification or CloudWatch metric here
else
echo "Backup failed!" >&2
exit 1
fi
- Compression: Use -Z 9 for maximum compression (custom format only)
- Network: Ensure sufficient bandwidth between DB server and S3
- Parallelism: For very large DBs, consider pg_dump's -j option
If AWS CLI isn't available, s3cmd works similarly:
pg_dump -Fc dbname | s3cmd put - s3://bucket/backup.dump
Always validate your backups:
aws s3 cp s3://your-bucket/backup.dump - | pg_restore -l
- Use IAM roles instead of access keys when possible
- Enable S3 server-side encryption (SSE-S3 or SSE-KMS)
- Implement S3 bucket policies for backup retention