Troubleshooting AWS S3 Sync –exclude Option: Why Git Folders Aren’t Being Excluded


4 views

When using aws s3 sync, many developers encounter unexpected behavior with the --exclude option. The documentation suggests simple glob patterns should work, but in practice, excluding directories like .git proves more challenging than expected.

The fundamental issue lies in how the AWS CLI interprets exclude patterns. The pattern matching isn't as straightforward as standard globbing in shell environments. Here's why your attempts didn't work:

# These patterns fail because:
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '*.git/*'  # Too broad
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '*/.git/*' # Needs proper escaping
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '.git'     # Only matches files

After extensive testing, I've found these approaches actually work:

# Method 1: Double asterisk pattern
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '**/.git/*'

# Method 2: Escaped pattern with recursive matching
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '*/.git/*' --exclude '*/.git'

# Method 3: Combined include/exclude approach
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '*'
  --include '*'
  --exclude '**/.git/*'

The AWS S3 sync command uses a specific pattern matching implementation:

  • Patterns are relative to the source directory
  • ** matches any number of directories
  • * matches any characters within a single directory level
  • Order of include/exclude flags matters (last match wins)

For more complex scenarios, consider these approaches:

# Exclude multiple directory types
aws s3 sync /var/www s3://backup-bucket/var/www/ \
  --exclude '**/.git/*' \
  --exclude '**/cache/*' \
  --exclude '**/tmp/*'

# Using an exclude file for complex patterns
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude ".git*" \
  --exclude-from exclude-file.txt

Where exclude-file.txt contains:

*.log
*.tmp
*/cache/*
*/node_modules/*

Note that pattern matching behavior may vary between AWS CLI versions. The solutions above work with:

  • AWS CLI 1.x (tested with 1.1.1 through 1.16.x)
  • AWS CLI 2.x (all current versions)

When using aws s3 sync on EC2 instances, many developers struggle with getting the exclusion patterns to work as expected. The core issue lies in how the AWS CLI interprets glob patterns and relative paths.

The AWS CLI documentation states that patterns should work similarly to Linux glob patterns, but there are important nuances:

# These patterns DON'T work as expected:
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '*.git/*'
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '*/.git/*'

After extensive testing, here are the patterns that actually work:

# Solution 1: Double wildcard pattern
aws s3 sync /var/www s3://backup-bucket/var/www/ --exclude '**/.git/**'

# Solution 2: Multiple patterns approach
aws s3 sync /var/www s3://backup-bucket/var/www/ \
    --exclude '.git/*' \
    --exclude '*/.git/*' \
    --exclude '*/.git/'

# Solution 3: Using include/exclude files
aws s3 sync /var/www s3://backup-bucket/var/www/ \
    --exclude "*" \
    --include "*" \
    --exclude ".git/*"

The AWS CLI processes patterns differently than standard shell globbing:

  • Single asterisk (*) only matches within a single path segment
  • Double asterisk (**) matches across path segments
  • Patterns are evaluated relative to the source directory

For complex scenarios, consider these approaches:

# Exclude multiple directories
aws s3 sync /var/www s3://backup-bucket/var/www/ \
    --exclude '**/.git/**' \
    --exclude '**/cache/**' \
    --exclude '**/tmp/**'

# Using an exclude file
aws s3 sync /var/www s3://backup-bucket/var/www/ \
    --exclude-from exclude.txt

Sample exclude.txt contents:

# Exclude patterns
**/.git/**
**/node_modules/**
*.log
*.tmp

Behavior varies across AWS CLI versions. For best results:

# Upgrade to latest version
pip install --upgrade awscli