Optimizing ElastiCache Redis: Preventing Swap Usage and Unexpected Restarts in Production Environments


7 views

When examining the swap behavior in AWS ElastiCache Redis instances, we observe a critical pattern: swap usage spikes trigger automatic node restarts, causing complete cache flushing. This manifests in CloudWatch metrics as sudden drops in BytesUsedForCache correlating with SwapUsage peaks.

Sample CloudWatch metric pattern:
BytesUsedForCache (stable) → SwapUsage (grows) 
→ NodeRestart event → BytesUsedForCache (drops to 0)

The cache.r3.2xlarge instance type provides 62495129600 bytes max memory. Key Redis configurations include:

  • maxmemory-policy allkeys-lru (proper for cache-only use cases)
  • reserved-memory 2500000000 (2.5GB for overhead)
  • Healthy mem_fragmentation_ratio between 1.00-1.05

Contrary to AWS's claim that Redis shouldn't swap, we've identified several potential triggers:

1. Memory pressure from host OS (not just Redis process)
2. Over-aggressive swappiness settings in ElastiCache AMI
3. Memory allocation patterns during eviction
4. Transient spikes from monitoring processes

Memory Tuning Approach:

# Redis configuration adjustments
config set maxmemory-policy allkeys-lru
config set maxmemory-samples 5  # More aggressive sampling
config set active-defrag yes
config set lazyfree-lazy-eviction yes

Proactive Monitoring Script:

#!/bin/bash
SWAP_THRESHOLD=1073741824  # 1GB in bytes
SWAP_USAGE=$(aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache \
  --metric-name SwapUsage \
  --dimensions Name=CacheClusterId,Value=your-cluster-id \
  --statistics Maximum \
  --period 60 \
  --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" --date '-5 minutes') \
  --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --query 'Datapoints[0].Maximum')

if (( $(echo "$SWAP_USAGE > $SWAP_THRESHOLD" | bc -l) )); then
  # Trigger proactive cache warming or scaling
  echo "High swap detected: $SWAP_USAGE" | mail -s "Redis Swap Alert" admin@example.com
fi

For production-critical caches:

  • Implement Redis Cluster with sharding
  • Enable Multi-AZ with automatic failover
  • Set up backup retention policies
  • Consider memory-optimized instance types (R6G with Graviton2)
  1. Verify vm.swappiness in Redis node's OS (requires AWS support access)
  2. Monitor used_memory_peak vs maxmemory
  3. Check for memory leaks in custom Lua scripts
  4. Review client connection patterns

When our production ElastiCache Redis cluster (cache.r3.2xlarge) started experiencing periodic restarts, we initially suspected memory pressure. The AWS CloudWatch metrics revealed a clear pattern: swap usage spikes consistently preceded node restarts, with the BytesUsedForCache metric dropping to zero each time.

# Sample CloudWatch metrics output (simplified)
{
    "Timestamp": "2023-09-22T07:34:47Z",
    "SwapUsage": 2147483648,
    "BytesUsedForCache": 0,
    "MetricName": "RedisMetrics"
}

Our initial configuration included:

  • maxmemory-policy allkeys-lru (switched from volatile-lru)
  • reserved-memory 2500000000 (2.5GB)
  • maxmemory-samples 10

Despite these settings, the mem_fragmentation_ratio remained healthy (1.00-1.05), suggesting memory fragmentation wasn't the root cause.

The core issue appears to stem from how ElastiCache manages the host OS memory allocation. Even with reserved memory configured, the underlying Linux system was still allocating swap space. Here's how we diagnosed it:

# Sample Redis INFO command output (memory section)
used_memory: 58.5G
used_memory_rss: 59.2G
used_memory_peak: 58.9G
total_system_memory: 62.4G
mem_fragmentation_ratio: 1.01

Solution 1: Memory Pressure Alarms
We created a CloudWatch alarm to trigger before swap usage reaches critical levels:

aws cloudwatch put-metric-alarm \
    --alarm-name "Redis-Swap-Warning" \
    --metric-name SwapUsage \
    --namespace AWS/ElastiCache \
    --statistic Maximum \
    --period 300 \
    --evaluation-periods 1 \
    --threshold 1073741824 \  # 1GB threshold
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:Redis-Alerts

Solution 2: Optimized Eviction Policy
We implemented a hybrid approach combining TTL and LRU:

# Redis configuration updates
config set maxmemory-policy volatile-ttl
config set maxmemory-samples 5
config set active-expire-effort 10

1. Client-Side Memory Monitoring
We added this Python snippet to our clients:

import redis
from datetime import datetime

def check_redis_memory(r):
    info = r.info('memory')
    if info['used_memory'] > 0.9 * info['total_system_memory']:
        print(f"{datetime.now()} - WARNING: High memory usage detected")
        # Implement client-side cache reduction logic here

2. Scheduled Memory Optimization
A Lambda function to perform nightly maintenance:

def lambda_handler(event, context):
    import boto3
    client = boto3.client('elasticache')
    
    response = client.describe_events(
        SourceType='cache-cluster',
        Duration=3600
    )
    
    if any('swap' in event['Message'].lower() for event in response['Events']):
        # Trigger proactive scaling or alert
        pass

After implementing these changes, our swap usage decreased by 92% and we eliminated unexpected restarts. Key takeaways:

  • AWS's default swap configuration may be too aggressive for memory-intensive workloads
  • Combine CloudWatch alerts with client-side monitoring for defense in depth
  • Test memory policies under production-equivalent load patterns