As developers working with storage systems, filesystems, or fault-tolerant applications, we often need to test how our software behaves when underlying block devices fail. Real hardware failures are unpredictable, making controlled testing challenging. Here are several reliable methods to simulate I/O errors.
The device-mapper's 'flakey' target is perfect for simulating intermittent failures:
# Create a flakey device that fails every 2nd I/O sudo dmsetup create flakey-device --table "0 102400 flakey /dev/sdb1 0 1 2"
This creates a device that will:
- Work normally for the first I/O
- Fail the second I/O with EIO
- Repeat this pattern
If you're testing kernel modules or low-level systems:
# Enable fault injection for block devices echo 1 > /sys/kernel/debug/fail_make_request/fail-nth echo 100 > /sys/kernel/debug/fail_make_request/probability
For application-level testing without root access:
#includestatic int read_error(const char *path, char *buf, size_t size) { return -EIO; // Simulate read error } static struct fuse_operations ops = { .read = read_error, }; // Mount with: ./errorfs /mnt/fuse -f
For simulating actual media errors:
sudo badblocks -sv -b 4096 -e 1 /dev/sdX
For distributed systems testing, combine dm-flakey with network partitioning:
# On node1: sudo dmsetup create flaky-disk --table "0 $(blockdev --getsz /dev/sdb) flakey /dev/sdb 0 1 1" # On node2: sudo iptables -A INPUT -p tcp --dport 3260 -j DROP
Remember to cleanup test devices after use:
sudo dmsetup remove flakey-device sudo umount /mnt/fuse
Testing error handling is crucial for robust storage systems. Developers need to verify how their applications behave when disks fail or return errors. Here are common scenarios where you might want to simulate I/O errors:
- Filesystem error handling verification
- Database recovery testing
- Distributed storage system fault tolerance checks
- RAID controller failure scenarios
The most reliable method is using Linux's Device Mapper to create a virtual block device that produces errors:
# Create a 1GB error device dd if=/dev/zero of=error_image bs=1M count=1024 losetup /dev/loop0 error_image # Set up dm-error echo "0 2097152 linear /dev/loop0 0 2097152 512 error" | dmsetup create error-dev
This creates a device where the first 2GB works normally, then hits a 512-byte error sector. Any I/O to the error sector will fail.
For more dynamic control, use the kernel's fault injection capabilities:
# Enable fault injection echo 1 > /sys/block/sdX/make-it-fail # Configure failure parameters (probability, times, etc.) echo 10 > /sys/block/sdX/fail_make_request echo 5 > /sys/block/sdX/fail_nth
For ext2/3/4 filesystems, debugfs can corrupt specific blocks:
debugfs -w /dev/sdX1 debugfs: clri /path/to/file # Clears inode debugfs: bd /path/to/file # Marks block as bad
When testing network storage (iSCSI, NBD), use tc for network errors:
tc qdisc add dev eth0 root netem loss 10% corrupt 5%
When implementing these methods:
- Always use separate test machines or VMs
- Monitor dmesg for actual error messages
- Test both read and write failures
- Combine with stress-ng for realistic scenarios
Here's how to test PostgreSQL failure recovery:
# Create error device dmsetup create pg-error --table "0 1048576 linear /dev/sdb1 0 1048576 error" # Mount as PostgreSQL data directory mount /dev/mapper/pg-error /var/lib/postgresql/14/main # The database will fail when accessing corrupted areas systemctl start postgresql