Optimizing ext3 Directory Structure: Performance Guidelines for Storing Millions of Files

Working with ext3 filesystems at scale reveals a fundamental limitation: directory lookup performance degrades exponentially as the number of files increases. The default linear directory indexing in ext3 means operations like readdir() or stat() become O(n) operations.

Based on production experience with web-scale applications:

1-10k files: Near instantaneous operations (1-10ms)
10-50k files: Noticeable lag (50-200ms)
50-100k files: UI-freezing delays (300-1000ms)
100k+ files: Often triggers filesystem hangs

For storing 3 million files, we recommend a 3-level hashed directory structure:


def get_storage_path(filename, base="/data"):
    # Generate 3-level hash structure (2 chars per level)
    md5 = hashlib.md5(filename.encode()).hexdigest()
    return f"{base}/{md5[0:2]}/{md5[2:4]}/{md5[4:6]}/{filename}"

# Example usage:
storage_path = get_storage_path("document12345.pdf")
# Returns: /data/a1/b2/c3/document12345.pdf

Combine directory structure optimization with these ext3 tweaks:


# /etc/fstab optimizations
/dev/sdb1 /data ext3 noatime,data=writeback,dir_index 0 2

# Increase inode cache
echo 1000000 > /proc/sys/fs/inode-max

A Python script to reorganize existing flat directories:


import os
import hashlib
from pathlib import Path

def migrate_files(source_dir, target_dir):
    for filename in os.listdir(source_dir):
        src = os.path.join(source_dir, filename)
        if os.path.isfile(src):
            dest = get_storage_path(filename, target_dir)
            Path(dest).parent.mkdir(parents=True, exist_ok=True)
            os.rename(src, dest)

For extreme-scale systems:

Consider XFS or ext4 with dir_nlink disabled
Implement a content-addressable storage layer
Use distributed filesystems like Ceph for petabyte-scale

When working with filesystems like ext3, directory performance degrades significantly as the number of files increases. The default linear directory indexing in ext3 becomes inefficient when dealing with more than 10,000-15,000 files in a single directory. At 3 million files, operations like ls or readdir() become painfully slow because the system must scan the entire directory structure linearly.

The most effective approach is implementing a hashed directory structure. Here's a Python example for creating a 2-level deep structure:

import os
import hashlib

def get_hashed_path(filename, base_dir, depth=2):
    h = hashlib.md5(filename.encode()).hexdigest()
    path = base_dir
    for i in range(depth):
        path = os.path.join(path, h[i])
    return path

def create_hashed_file(filename, content, base_dir):
    dest_path = get_hashed_path(filename, base_dir)
    os.makedirs(dest_path, exist_ok=True)
    with open(os.path.join(dest_path, filename), 'w') as f:
        f.write(content)

For optimal performance with ext3:

First-level directories: 16-36 subdirectories (0-9, a-z)
Second-level: Another 16-36 subdirectories
Target files per directory: Keep under 10,000 for good performance

For a system needing to store 3 million files, a 3-level structure (16x16x16) would yield:

4096 directories (16^3)
~732 files per directory (3,000,000/4096)

This maintains excellent performance while being simple to implement.

When migrating existing files, use a script like this bash example:

#!/bin/bash
BASE_DIR="/data"
NEW_BASE="/new_structure"

find "$BASE_DIR" -type f | while read file; do
    filename=$(basename "$file")
    dir1=${filename:0:1}
    dir2=${filename:1:1}
    mkdir -p "$NEW_BASE/$dir1/$dir2"
    mv "$file" "$NEW_BASE/$dir1/$dir2/$filename"
done

After implementation, monitor performance with:

# Time directory listing
time ls /path/to/directory | wc -l

# Check inode usage
df -i

ServerDevWorker

Optimizing ext3 Directory Structure: Performance Guidelines for Storing Millions of Files

Related Articles