Debugging EBS gp3 Throughput Credit Exhaustion in AWS RDS PostgreSQL Instances


2 views

>
>
>
>

When working with AWS RDS PostgreSQL on gp3 volumes, many developers assume the 125 MiB/s baseline throughput applies uniformly. However, the reality involves a burst bucket mechanism that can significantly impact performance:

>
>

# Sample CloudWatch query for EBS metrics
>
>aws cloudwatch get-metric-statistics \
>
>  --namespace AWS/RDS \
>
>  --metric-name EBSByteBalance% \
>
>  --dimensions Name=DBInstanceIdentifier,Value=your-db-instance \
>
>  --start-time $(date -d "3 days ago" +%Y-%m-%dT%H:%M:%S) \
>
>  --end-time $(date -d "now" +%Y-%m-%dT%H:%M:%S) \
>
>  --period 3600 \
>
>  --statistics Average
>
>

>
>
>
>

gp3 volumes operate with a token bucket algorithm where:

>
>

    >
    >

  • Initial burst balance: 1,024,000,000 credits (equal to 3,000 MB/s for 5 minutes)
  • >
    >

  • Accumulation rate: 3,000 MB/s per TB of volume size (600 MB/s for 200GB)
  • >
    >

  • Minimum baseline: 125 MB/s regardless of volume size
  • >
    >

>
>
>
>

When seeing consistent EBSByteBalance% depletion, implement these solutions:

>
>

-- PostgreSQL query to identify high-I/O operations
>
>SELECT query, calls, total_time, rows,
>
>       shared_blks_hit, shared_blks_read
>
>FROM pg_stat_statements
>
>ORDER BY shared_blks_read DESC
>
>LIMIT 10;
>
>

>
>
>
>

    >
    >

  1. Volume Scaling: Increase to 400GB+ for better baseline (250MB/s)
  2. >
    >

  3. Provisioned IOPS: Use --iops parameter during modification
  4. >
    >

  5. Workload Distribution: Implement read replicas for analytic queries
  6. >
    >

>
>
>
>

Create these CloudWatch alarms:

>
>

aws cloudwatch put-metric-alarm \
>
>  --alarm-name "RDS-EBS-Credit-Low" \
>
>  --metric-name EBSByteBalance% \
>
>  --namespace AWS/RDS \
>
>  --threshold 20 \
>
>  --comparison-operator LessThanThreshold \
>
>  --evaluation-periods 3 \
>
>  --period 300 \
>
>  --alarm-actions arn:aws:sns:us-east-1:123456789012:MyAlarmNotification
>
>

>
>


When working with AWS RDS gp3 volumes, it's crucial to understand how the burst credit system operates. While gp3 volumes below 400GB do provide a baseline throughput of 125MiB/s, this is only available when you have sufficient burst credits in your EBSByteBalance%.


# Sample CloudWatch query to check credit balance
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name EBSByteBalance% \
  --dimensions Name=DBInstanceIdentifier,Value=your-db-instance \
  --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" --date="-3 days") \
  --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --period 3600 \
  --statistics Average \
  --output json

The key misunderstanding here is that the 125MiB/s baseline isn't free - it still consumes credits when used. The baseline is simply the maximum throughput you can achieve while consuming credits at the standard rate. Your instance appears to be:

  • Consistently operating at 5-7MiB/s (which is above the true baseline)
  • Experiencing spikes that rapidly deplete credits
  • Not getting sufficient idle time to replenish credits

Here are three approaches to stabilize your RDS performance:


# Option 1: Increase volume size to get higher baseline
aws rds modify-db-instance \
  --db-instance-identifier your-db-instance \
  --allocated-storage 400 \  # Crosses the 400GB threshold
  --apply-immediately

# Option 2: Provision additional IOPS (costs extra)
aws rds modify-db-instance \
  --db-instance-identifier your-db-instance \
  --iops 6000 \  # Example value
  --apply-immediately

# Option 3: Implement read replica for load distribution
aws rds create-db-instance-read-replica \
  --db-instance-identifier your-db-instance \
  --db-instance-identifier replica-instance \
  --source-region us-west-2

Set up proactive monitoring to prevent future incidents:


# CloudWatch alarm for credit balance
aws cloudwatch put-metric-alarm \
  --alarm-name RDS-Credit-Balance-Low \
  --alarm-description "EBS burst credits below 20%" \
  --metric-name EBSByteBalance% \
  --namespace AWS/RDS \
  --statistic Average \
  --period 300 \
  --threshold 20 \
  --comparison-operator LessThanThreshold \
  --dimensions Name=DBInstanceIdentifier,Value=your-db-instance \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-west-2:123456789012:MyTopic

Beyond storage configuration, consider these database-level optimizations:

  • Review slow queries with pg_stat_statements
  • Adjust work_mem and maintenance_work_mem parameters
  • Implement connection pooling to reduce overhead
  • Schedule heavy operations during off-peak hours