Top 10 Sysadmin Horror Stories: From rm -rf Disasters to Production DB Deletions


3 views

One sysadmin shared how a simple cleanup script turned catastrophic when it contained:

# Oops! Missing the directory path
rm -rf / var/log/apache2/*

The extra space between / and var made the command target the root directory. They caught it after 30 seconds, but critical system files were already gone.

A DBA meant to drop a test database but accidentally targeted production:

-- Meant to run on test-server-01
DROP DATABASE customer_data;
-- Executed on prod-db-01 instead

4 hours of transaction data lost before restoring from backups.

A storage admin was expanding a LUN but formatted the wrong device:

# Meant to format /dev/sdh1
mkfs.ext4 /dev/sdi1
# Oops - that was the backup server's disk

Bonus points: The backup server was handling that night's backups.

A Python script for cleaning old VMs had a logic error:

# Bug in age calculation
if vm.created_at < datetime.now() - timedelta(days=30):
    vm.delete()  # Ran on all VMs due to negative time delta

An engineer testing HA configuration took down both nodes:

# Testing failover
systemctl stop haproxy@node1
systemctl stop haproxy@node2  # Oops, forgot to start node1 first

All external services were down for 18 minutes.

  • Always double-check paths and device names
  • Implement --dry-run flags for destructive operations
  • Use confirmation prompts for production systems
  • Test automation scripts with non-destructive flags first

```html

During a routine Exchange Server migration, I accidentally ran:

Get-Mailbox -Database "Old_DB" | Remove-Mailbox -Confirm:$false

...forgetting the -WhatIf flag. 347 executives lost all emails since 2018. The restore from backup took 19 hours.

A junior admin meant to format a test SAN volume but targeted /dev/sda instead of /dev/sdb:

# DO NOT TRY THIS:
mkfs.ext4 /dev/sda

Pro tip: Always triple-check device IDs with lsblk -f before formatting.

While cleaning old records, this BIND config "optimization":

zone "prod.example.com" {
    type master;
    file "/etc/bind/db.empty"; // Oops
};

...took down 200 microservices. Moral: Never edit live DNS without named-checkconf.

A tired DBA executed this against production instead of staging:

mysql> DROP DATABASE transactions_primary;
Query OK, 83,491,227 rows affected

Point-in-time recovery saved us, but not before 14 minutes of payment processing failures.

A Python cleanup script with faulty logic:

for vm in vsphere.get_vms():
    if vm.name.startswith('temp_'): 
        vm.delete() # Ran at 3AM against ALL VMs

Lesson: Always test destructive automation with --dry-run first.

  • Backup verification: Regularly test tar -xvzf backup.tgz on isolated systems
  • Change windows: Never run risky ops during business hours
  • Terminal discipline: Prefix dangerous commands with # SAFETY CHECK: comments