When serving static files like images from a web server, the underlying filesystem becomes your invisible constraint. Most Linux filesystems (ext4, XFS, etc.) technically support millions of files per directory, but performance degrades significantly beyond certain thresholds.
# Benchmarking a directory with 300k files vs partitioned:
$ time ls /mnt/storage/300k_files/ | wc -l # 12.7 seconds
$ time ls /mnt/storage/partitioned/001/ | wc -l # 0.03 seconds
- Inode Lookup Overhead: Linear directory scans with readdir() become O(n) operations
- Cache Pollution: Directory entries compete for limited filesystem cache space
- Metadata Contention: Single directory mutex bottlenecks in kernel VFS layer
- Backup Challenges: rsync/tar operations may timeout or fail
For your 300,000 PNG files, consider these partitioning approaches:
// Hashing-based directory structure (recommended):
function getStoragePath($id) {
$hash = md5($id);
return sprintf(
'/storage/%s/%s/%s.png',
substr($hash, 0, 2),
substr($hash, 2, 2),
$id
);
}
// Example: ID 12345 → /storage/a3/f8/12345.png
Alternative numeric partitioning:
// Numeric directory partitioning:
/png/000/000001.png
/png/000/000999.png
/png/001/001000.png
...
/png/299/299999.png
For Nginx static file serving with partitioned directories:
server {
location ~ ^/images/([a-f0-9]{2})/([a-f0-9]{2})/(.+\.png)$ {
alias /storage/$1/$2/$3;
expires 30d;
access_log off;
}
}
Bash script to reorganize existing files:
#!/bin/bash
for file in /old_dir/*.png; do
filename=$(basename "$file")
hash=$(md5sum <<< "$filename" | cut -c1-4)
mkdir -p "/new_dir/${hash:0:2}/${hash:2:2}"
mv "$file" "/new_dir/${hash:0:2}/${hash:2:2}/$filename"
done
Key metrics to verify improvement:
# Check inode cache efficiency
grep -E 'dentry|inode' /proc/slabinfo
# Monitor directory lookup latency
strace -ttT -e getdents ls /your/dir >/dev/null
Storing 300,000+ static files (like images) in a single directory creates multiple performance bottlenecks:
- Filesystem Limitations: Most Linux filesystems (ext3/ext4) use linear directory indexing, causing O(n) lookup times. With 300K entries, simple operations like
stat()
oropen()
become expensive. - Inode Cache Pressure: The kernel's directory entry (dentry) cache gets overwhelmed, leading to frequent cache misses.
- Backup Challenges: Tools like
rsync
ortar
struggle with huge directories, often failing or consuming excessive RAM.
At YouTube-scale operations, we observed:
# Typical symptoms:
1. 500-1000ms latency for simple file requests
2. Kernel soft lockups during directory scans
3. "Too many open files" errors despite high ulimit settings
Implement a predictable sharding scheme. Here's a Python example for distributing 300K images:
import os
import hashlib
def get_sharded_path(filename, base_dir="/var/www/images", depth=2):
"""Hash-based directory sharding"""
hexdigest = hashlib.md5(filename.encode()).hexdigest()
path_parts = [base_dir] + [hexdigest[i*2:(i+1)*2] for i in range(depth)]
return os.path.join(*path_parts)
# Usage:
filepath = get_sharded_path("12345.png")
# Output: /var/www/images/12/34/12345.png
For the sharded structure, optimize NGINX with open file cache:
http {
open_file_cache max=10000 inactive=30s;
open_file_cache_valid 60s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
server {
location /images/ {
try_files $uri @backend;
}
}
}
For cloud-native solutions:
- Object Storage: Use S3/GCS with CDN (CloudFront, Cloudflare)
- Database-backed: Store metadata in PostgreSQL with
ltree
extension - Content-Addressable: Use SHA-256 hashes as filenames (like Git)
Benchmark with ab
before/after sharding:
# Before:
ab -n 1000 -c 50 http://example.com/images/123.png
→ 87% requests > 500ms
# After sharding:
ab -n 1000 -c 50 http://example.com/images/12/34/123.png
→ 99% requests < 50ms