Disaster Recovery After Accidental `rm -rf /` on Production Servers: Technical Deep Dive


4 views

When automating server maintenance through Ansible, a simple variable interpolation error transformed rm -rf {foo}/{bar} into the nuclear option rm -rf /. The script executed with root privileges across all production servers, including mounted backup storage. Within seconds, the entire infrastructure was wiped clean.

The --no-preserve-root flag bypassed the default protection against deleting root directories. Consider these dangerous variations that could trigger similar disasters:

# Dangerous patterns to avoid
rm -rf "$undefined_var/"*
rm -rf ${MISSING_VAR:-/fallback}/subdir
find / -name "*.log" -delete  # Without proper path restrictions

When you realize what happened:

  1. Power off affected machines immediately to prevent filesystem writes
  2. Detach storage devices but don't remount or run fsck
  3. Document exact commands and timeline for recovery specialists

Professional data recovery services can sometimes reconstruct partitions using tools like:

# For EXT4 filesystems (run on recovery environment)
debugfs /dev/sda1
lsdel  # List deleted inodes
stat <inode>
dump <inode> /recovery/file

Implement these safeguards in your automation code:

# Safe deletion template
target_dir="${VALIDATED_PATH:?Path validation failed}" || exit 1
[[ "$target_dir" != "/" ]] || { logger "ABORT: Root deletion attempt"; exit 1; }
rm -rf --one-file-system "${target_dir%/}/"*  # Trailing slash prevents root deletion

The only systems that escaped destruction followed the 3-2-1 rule:

  • Immutable S3 buckets with object locking
  • Air-gapped tape backups
  • Git-annex repositories with distributed verification

Every sysadmin's nightmare became reality when an Ansible playbook with undefined variables executed rm -rf {foo}/{bar} across 1,535 production servers. The variables expanded to empty strings, effectively turning the command into rm -rf / - and to make matters worse, the script had previously mounted the backup storage.

When you realize what's happened, take these steps immediately:

# 1. Isolate affected systems
for host in $(cat server_list); do
    ssh $host "systemctl isolate rescue.target"
done

# 2. Unmount all filesystems
umount -a

# 3. Stop all non-critical services
systemctl list-units --type=service | awk '{print $1}' | \
    grep -Ev 'crond|ssh|rsyslog' | xargs -r systemctl stop

Option 1: Filesystem-Level Recovery

For EXT4 filesystems, try:

debugfs -w /dev/sda1
debugfs: lsdel
debugfs: undelete <inode> /path/to/recover

Option 2: Block-Level Restoration

# Using ddrescue to clone damaged drive
ddrescue -f -n /dev/sda /dev/sdb rescue.log
ddrescue -d -f -r3 /dev/sda /dev/sdb rescue.log

Implement these safeguards in your Ansible playbooks:

- name: Critical file operations
  block:
    - name: Verify variables before use
      assert:
        that: foo is defined and bar is defined
        fail_msg: "Critical variables undefined!"
      
    - name: Safe rm operation with dry-run first
      command: rm -rf --dry-run {{ foo }}/{{ bar }}
      register: dry_run
      changed_when: false
    
    - name: Actual removal (if dry-run passes)
      command: rm -rf {{ foo }}/{{ bar }}
      when: dry_run.rc == 0
  rescue:
    - fail:
        msg: "Aborting due to undefined variables"

Implement the 3-2-1 rule with these improvements:

  • Immutable backups using S3 Object Lock or similar
  • Air-gapped backups not mounted during operations
  • Regular recovery drills using ansible-playbook --check

After such an event:

  1. Document everything for post-mortem
  2. Implement peer-review for critical operations
  3. Create a "break glass" procedure for emergencies