When automating server maintenance through Ansible, a simple variable interpolation error transformed rm -rf {foo}/{bar}
into the nuclear option rm -rf /
. The script executed with root privileges across all production servers, including mounted backup storage. Within seconds, the entire infrastructure was wiped clean.
The --no-preserve-root
flag bypassed the default protection against deleting root directories. Consider these dangerous variations that could trigger similar disasters:
# Dangerous patterns to avoid
rm -rf "$undefined_var/"*
rm -rf ${MISSING_VAR:-/fallback}/subdir
find / -name "*.log" -delete # Without proper path restrictions
When you realize what happened:
- Power off affected machines immediately to prevent filesystem writes
- Detach storage devices but don't remount or run fsck
- Document exact commands and timeline for recovery specialists
Professional data recovery services can sometimes reconstruct partitions using tools like:
# For EXT4 filesystems (run on recovery environment)
debugfs /dev/sda1
lsdel # List deleted inodes
stat <inode>
dump <inode> /recovery/file
Implement these safeguards in your automation code:
# Safe deletion template
target_dir="${VALIDATED_PATH:?Path validation failed}" || exit 1
[[ "$target_dir" != "/" ]] || { logger "ABORT: Root deletion attempt"; exit 1; }
rm -rf --one-file-system "${target_dir%/}/"* # Trailing slash prevents root deletion
The only systems that escaped destruction followed the 3-2-1 rule:
- Immutable S3 buckets with object locking
- Air-gapped tape backups
- Git-annex repositories with distributed verification
Every sysadmin's nightmare became reality when an Ansible playbook with undefined variables executed rm -rf {foo}/{bar}
across 1,535 production servers. The variables expanded to empty strings, effectively turning the command into rm -rf /
- and to make matters worse, the script had previously mounted the backup storage.
When you realize what's happened, take these steps immediately:
# 1. Isolate affected systems
for host in $(cat server_list); do
ssh $host "systemctl isolate rescue.target"
done
# 2. Unmount all filesystems
umount -a
# 3. Stop all non-critical services
systemctl list-units --type=service | awk '{print $1}' | \
grep -Ev 'crond|ssh|rsyslog' | xargs -r systemctl stop
Option 1: Filesystem-Level Recovery
For EXT4 filesystems, try:
debugfs -w /dev/sda1
debugfs: lsdel
debugfs: undelete <inode> /path/to/recover
Option 2: Block-Level Restoration
# Using ddrescue to clone damaged drive
ddrescue -f -n /dev/sda /dev/sdb rescue.log
ddrescue -d -f -r3 /dev/sda /dev/sdb rescue.log
Implement these safeguards in your Ansible playbooks:
- name: Critical file operations
block:
- name: Verify variables before use
assert:
that: foo is defined and bar is defined
fail_msg: "Critical variables undefined!"
- name: Safe rm operation with dry-run first
command: rm -rf --dry-run {{ foo }}/{{ bar }}
register: dry_run
changed_when: false
- name: Actual removal (if dry-run passes)
command: rm -rf {{ foo }}/{{ bar }}
when: dry_run.rc == 0
rescue:
- fail:
msg: "Aborting due to undefined variables"
Implement the 3-2-1 rule with these improvements:
- Immutable backups using S3 Object Lock or similar
- Air-gapped backups not mounted during operations
- Regular recovery drills using
ansible-playbook --check
After such an event:
- Document everything for post-mortem
- Implement peer-review for critical operations
- Create a "break glass" procedure for emergencies