One sysadmin shared how a simple cleanup script turned catastrophic when it contained:
# Oops! Missing the directory path
rm -rf / var/log/apache2/*
The extra space between /
and var
made the command target the root directory. They caught it after 30 seconds, but critical system files were already gone.
A DBA meant to drop a test database but accidentally targeted production:
-- Meant to run on test-server-01
DROP DATABASE customer_data;
-- Executed on prod-db-01 instead
4 hours of transaction data lost before restoring from backups.
A storage admin was expanding a LUN but formatted the wrong device:
# Meant to format /dev/sdh1
mkfs.ext4 /dev/sdi1
# Oops - that was the backup server's disk
Bonus points: The backup server was handling that night's backups.
A Python script for cleaning old VMs had a logic error:
# Bug in age calculation
if vm.created_at < datetime.now() - timedelta(days=30):
vm.delete() # Ran on all VMs due to negative time delta
An engineer testing HA configuration took down both nodes:
# Testing failover
systemctl stop haproxy@node1
systemctl stop haproxy@node2 # Oops, forgot to start node1 first
All external services were down for 18 minutes.
- Always double-check paths and device names
- Implement --dry-run flags for destructive operations
- Use confirmation prompts for production systems
- Test automation scripts with non-destructive flags first
```html
During a routine Exchange Server migration, I accidentally ran:
Get-Mailbox -Database "Old_DB" | Remove-Mailbox -Confirm:$false
...forgetting the -WhatIf
flag. 347 executives lost all emails since 2018. The restore from backup took 19 hours.
A junior admin meant to format a test SAN volume but targeted /dev/sda
instead of /dev/sdb
:
# DO NOT TRY THIS:
mkfs.ext4 /dev/sda
Pro tip: Always triple-check device IDs with lsblk -f
before formatting.
While cleaning old records, this BIND config "optimization":
zone "prod.example.com" {
type master;
file "/etc/bind/db.empty"; // Oops
};
...took down 200 microservices. Moral: Never edit live DNS without named-checkconf
.
A tired DBA executed this against production instead of staging:
mysql> DROP DATABASE transactions_primary;
Query OK, 83,491,227 rows affected
Point-in-time recovery saved us, but not before 14 minutes of payment processing failures.
A Python cleanup script with faulty logic:
for vm in vsphere.get_vms():
if vm.name.startswith('temp_'):
vm.delete() # Ran at 3AM against ALL VMs
Lesson: Always test destructive automation with --dry-run
first.
- Backup verification: Regularly test
tar -xvzf backup.tgz
on isolated systems - Change windows: Never run risky ops during business hours
- Terminal discipline: Prefix dangerous commands with
# SAFETY CHECK:
comments