Working with ext3 filesystems at scale reveals a fundamental limitation: directory lookup performance degrades exponentially as the number of files increases. The default linear directory indexing in ext3 means operations like readdir()
or stat()
become O(n) operations.
Based on production experience with web-scale applications:
- 1-10k files: Near instantaneous operations (1-10ms)
- 10-50k files: Noticeable lag (50-200ms)
- 50-100k files: UI-freezing delays (300-1000ms)
- 100k+ files: Often triggers filesystem hangs
For storing 3 million files, we recommend a 3-level hashed directory structure:
def get_storage_path(filename, base="/data"):
# Generate 3-level hash structure (2 chars per level)
md5 = hashlib.md5(filename.encode()).hexdigest()
return f"{base}/{md5[0:2]}/{md5[2:4]}/{md5[4:6]}/{filename}"
# Example usage:
storage_path = get_storage_path("document12345.pdf")
# Returns: /data/a1/b2/c3/document12345.pdf
Combine directory structure optimization with these ext3 tweaks:
# /etc/fstab optimizations
/dev/sdb1 /data ext3 noatime,data=writeback,dir_index 0 2
# Increase inode cache
echo 1000000 > /proc/sys/fs/inode-max
A Python script to reorganize existing flat directories:
import os
import hashlib
from pathlib import Path
def migrate_files(source_dir, target_dir):
for filename in os.listdir(source_dir):
src = os.path.join(source_dir, filename)
if os.path.isfile(src):
dest = get_storage_path(filename, target_dir)
Path(dest).parent.mkdir(parents=True, exist_ok=True)
os.rename(src, dest)
For extreme-scale systems:
- Consider XFS or ext4 with
dir_nlink
disabled - Implement a content-addressable storage layer
- Use distributed filesystems like Ceph for petabyte-scale
When working with filesystems like ext3, directory performance degrades significantly as the number of files increases. The default linear directory indexing in ext3 becomes inefficient when dealing with more than 10,000-15,000 files in a single directory. At 3 million files, operations like ls
or readdir()
become painfully slow because the system must scan the entire directory structure linearly.
The most effective approach is implementing a hashed directory structure. Here's a Python example for creating a 2-level deep structure:
import os
import hashlib
def get_hashed_path(filename, base_dir, depth=2):
h = hashlib.md5(filename.encode()).hexdigest()
path = base_dir
for i in range(depth):
path = os.path.join(path, h[i])
return path
def create_hashed_file(filename, content, base_dir):
dest_path = get_hashed_path(filename, base_dir)
os.makedirs(dest_path, exist_ok=True)
with open(os.path.join(dest_path, filename), 'w') as f:
f.write(content)
For optimal performance with ext3:
- First-level directories: 16-36 subdirectories (0-9, a-z)
- Second-level: Another 16-36 subdirectories
- Target files per directory: Keep under 10,000 for good performance
For a system needing to store 3 million files, a 3-level structure (16x16x16) would yield:
4096 directories (16^3)
~732 files per directory (3,000,000/4096)
This maintains excellent performance while being simple to implement.
When migrating existing files, use a script like this bash example:
#!/bin/bash
BASE_DIR="/data"
NEW_BASE="/new_structure"
find "$BASE_DIR" -type f | while read file; do
filename=$(basename "$file")
dir1=${filename:0:1}
dir2=${filename:1:1}
mkdir -p "$NEW_BASE/$dir1/$dir2"
mv "$file" "$NEW_BASE/$dir1/$dir2/$filename"
done
After implementation, monitor performance with:
# Time directory listing
time ls /path/to/directory | wc -l
# Check inode usage
df -i