How to Programmatically Count Files in an Amazon S3 Bucket or Folder


3 views

When working with large S3 buckets containing thousands of files, the AWS Management Console becomes impractical for obtaining file counts. The web interface paginates results and doesn't provide aggregate statistics, making manual counting impossible for production-scale buckets.

The most efficient method is using AWS Command Line Interface (CLI) with the list-objects command:

aws s3api list-objects --bucket YOUR_BUCKET_NAME --prefix "folder/path/" --output json --query "length(Contents[])"

This command returns the exact count of objects in the specified path. For buckets with versioning enabled, add --no-truncate to ensure complete results.

For integration into applications, here are implementations in popular languages:

Python (Boto3)

import boto3

def count_s3_objects(bucket_name, prefix=''):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket_name)
    return sum(1 for _ in bucket.objects.filter(Prefix=prefix))

# Usage
file_count = count_s3_objects('your-bucket', 'target-folder/')
print(f"Total files: {file_count}")

Node.js

const AWS = require('aws-sdk');
const s3 = new AWS.S3();

async function countObjects(bucket, prefix = '') {
  let count = 0;
  let isTruncated = true;
  let marker;
  
  while (isTruncated) {
    const params = {
      Bucket: bucket,
      Prefix: prefix,
      Marker: marker
    };
    
    const data = await s3.listObjects(params).promise();
    count += data.Contents.length;
    isTruncated = data.IsTruncated;
    if (isTruncated) marker = data.Contents.slice(-1)[0].Key;
  }
  
  return count;
}

// Usage
countObjects('your-bucket', 'path/to/folder/')
  .then(count => console.log(Total objects: ${count}));

For buckets containing millions of objects:

  • Use S3 Inventory for daily reports
  • Implement parallel requests for faster counting
  • Consider AWS Athena for SQL-based queries
  • Cache results when possible

For teams needing frequent statistics:

  1. Set up CloudWatch metrics with S3 Storage Lens
  2. Create Lambda functions triggered by S3 events
  3. Use AWS Glue crawlers for metadata collection

When working with large S3 buckets containing thousands of objects, the AWS Management Console's paginated view becomes impractical for obtaining accurate file counts. For developers automating processes or monitoring storage, programmatic solutions are essential.

The AWS Command Line Interface provides several efficient ways to count objects:


# Basic count for all objects in a bucket
aws s3 ls s3://your-bucket-name --recursive | wc -l

# Count objects in a specific prefix (folder)
aws s3 ls s3://your-bucket-name/your-folder/ --recursive | wc -l

# More accurate method using list-objects
aws s3api list-objects --bucket your-bucket-name --prefix "your-folder/" \
--query "length(Contents[])" --output text

For more complex requirements, the Boto3 SDK offers greater flexibility:


import boto3

def count_s3_objects(bucket_name, prefix=''):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    
    count = 0
    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
        if 'Contents' in page:
            count += len(page['Contents'])
    
    return count

# Example usage
total_files = count_s3_objects('your-bucket-name', 'your-folder/')
print(f"Total files: {total_files}")

When dealing with extremely large buckets:

  • Use S3 Inventory for regular reporting
  • Consider AWS Athena for SQL-like queries on S3 metadata
  • Implement CloudWatch Metrics for monitoring
  • Be aware of API request costs at scale

For production environments:


# Parallel processing version
import boto3
from concurrent.futures import ThreadPoolExecutor

def get_page_count(args):
    s3, bucket, prefix, token = args
    kwargs = {'Bucket': bucket, 'Prefix': prefix}
    if token: kwargs['ContinuationToken'] = token
    
    response = s3.list_objects_v2(**kwargs)
    return len(response.get('Contents', [])), response.get('NextContinuationToken')

def parallel_count(bucket_name, prefix='', max_workers=10):
    s3 = boto3.client('s3')
    token = None
    total = 0
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        while True:
            future = executor.submit(get_page_count, (s3, bucket_name, prefix, token))
            count, token = future.result()
            total += count
            if not token:
                break
    
    return total