When working with AWS S3 for cryptocurrency data storage, I encountered a puzzling billing scenario. Despite maintaining minimal actual storage (around 0.5GB), my AWS bill showed storage consumption of nearly 4TB. Here's what I discovered about this common but often misunderstood situation.
AWS calculates storage costs based on TimedStorage-ByteHrs, which measures the cumulative byte-hours of storage used during the billing period. The formula essentially works like this:
Total Storage Cost = (Sum of all bytes stored each hour) / (1024^3) * price_per_GB
The most likely explanation for the discrepancy is S3 versioning. When versioning is enabled:
- Every object modification creates a new version
- All versions contribute to storage calculations
- Deleted files remain as "delete markers" until permanently removed
For my cryptocurrency data pipeline with frequent CSV updates, this meant:
# Example of version accumulation
for i in {1..1440}; do
aws s3 cp data.csv s3://my-bucket/data.csv # Creates new version each time
done
# Actual storage might show:
aws s3 ls --summarize --human-readable --recursive s3://my-bucket
# => Total Objects: 1 (shows only current version)
# => Total Size: 10MB
To check for versioned objects:
aws s3api list-object-versions --bucket my-bucket --query 'Versions[].{Key:Key,Size:Size}'
This might reveal hundreds or thousands of versions for your frequently updated files.
Option 1: Disable Versioning
If version history isn't required:
aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Suspended
Option 2: Implement Lifecycle Rules
For cases where versioning is needed but costs must be controlled:
{
"Rules": [
{
"ID": "RemoveOldVersions",
"Status": "Enabled",
"Prefix": "",
"NoncurrentVersionExpiration": {
"NoncurrentDays": 1
}
}
]
}
Option 3: Alternative Storage Pattern
Instead of overwriting the same file, consider timestamped filenames:
# Python example
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"data_{timestamp}.csv"
aws_path = f"s3://my-bucket/{filename}"
Set up AWS Cost Explorer with these filters:
- Service: Amazon S3
- Usage Type: TimedStorage-ByteHrs
- Group By: API Operation
This helps identify exactly which operations contribute most to storage costs.
While versioning is the most common culprit, also consider:
- Multipart uploads that weren't completed properly
- S3 replication configurations
- Glacier Deep Archive transition rules
Always cross-verify with both the AWS console and CLI tools for complete visibility.
When examining AWS S3 billing, the key metric to understand is TimedStorage-ByteHours. This measures storage consumption aggregated over time, not instantaneous usage. Let me break down the math for your specific case:
// Sample calculation for 15-minute interval CSV updates const dailyWrites = 24 * (60 / 15); // 96 writes/day const fileSize = 10 * 1024 * 1024; // 10MB in bytes const dailyByteHours = dailyWrites * fileSize * 0.25; // Hours active // 96 * 10,485,760 * 0.25 = ~251,658,240 byte-hours/day
Common culprits for inflated storage metrics include:
- Object Versioning: Enabled by default in some configurations
- Storage Class Transitions: Objects moving between S3 Standard/IA/Glacier
- Incomplete Multipart Uploads: Leftover parts consuming space
Check versioning status with AWS CLI:
aws s3api get-bucket-versioning --bucket YOUR_BUCKET_NAME
Create this Python script to audit actual storage usage:
import boto3
from datetime import datetime, timedelta
def check_bucket_usage(bucket_name):
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
total_size = 0
for obj in bucket.objects.all():
total_size += obj.size
print(f"Actual storage used: {total_size/1024/1024:.2f} MB")
print(f"Last modified object: {max(o.last_modified for o in bucket.objects.all())}")
check_bucket_usage('your-bucket-name')
- Implement S3 Lifecycle Policies to automatically transition/expire objects
- Set up CloudWatch Metrics for storage tracking:
aws cloudwatch put-metric-alarm \ --alarm-name "S3-Storage-Spike" \ --metric-name BucketSizeBytes \ --namespace AWS/S3 \ --statistic Average \ --period 86400 \ --threshold 1073741824 \ # 1GB --comparison-operator GreaterThanThreshold \ --dimensions Name=BucketName,Value=your-bucket Name=StorageType,Value=StandardStorage - Regularly clean up failed multipart uploads
For your cryptocurrency data collection system, consider this optimized architecture:
# Sample Lambda function for optimized S3 writes
import boto3
import pandas as pd
def lambda_handler(event, context):
s3 = boto3.client('s3')
# Aggregate data before writing
df = pd.concat([pd.read_csv(f) for f in event['csv_files']])
# Write compressed version
df.to_parquet(
f"s3://your-bucket/{datetime.now().isoformat()}.parquet.gzip",
compression='gzip'
)
# Cleanup temp files
for key in event['csv_files']:
s3.delete_object(Bucket='your-bucket', Key=key)