When acrid smoke permeates a server room, the immediate challenge isn't just putting out fires - it's identifying the source before critical systems fail. During my last incident, we spent precious hours ruling out possibilities before discovering a failing UPS battery module. The experience revealed critical gaps in our diagnostic toolkit.
Traditional sniff tests fail when odors disperse uniformly. Here's a Python script that correlates temperature anomalies with rack locations:
import psutil, gpiozero
from thermal_cam import FlirLepton
def detect_thermal_anomalies(threshold=80):
lepton = FlirLepton()
rack_map = {
'A1': (0, 0, 600, 800), # Coordinates for rack A1 in camera view
'B2': (600, 0, 1200, 800)
}
hotspots = []
for rack, coords in rack_map.items():
temp_profile = lepton.get_region_temp(*coords)
if max(temp_profile) > threshold:
hotspots.append({
'rack': rack,
'max_temp': max(temp_profile),
'position': temp_profile.index(max(temp_profile))
})
return sorted(hotspots, key=lambda x: x['max_temp'], reverse=True)
Most modern UPS units expose SNMP interfaces. This Bash snippet polls critical metrics every 30 seconds:
#!/bin/bash
while true; do
battery_status=$(snmpwalk -v 2c -c public ups_host .1.3.6.1.4.1.318.1.1.1.2.1.1.0)
temp_readings=$(snmpwalk -v 2c -c public ups_host .1.3.6.1.4.1.318.1.1.1.2.2.3.0)
if [[ $battery_status =~ "abnormal" ]] || [[ $temp_readings -gt 50 ]]; then
notify-send -u critical "UPS ALERT: $battery_status at ${temp_readings}C"
play alert.wav
fi
sleep 30
done
Each device type creates distinct power draw patterns when failing. This Arduino sketch monitors PDU branches:
#include "EmonLib.h"
EnergyMonitor emon;
void setup() {
Serial.begin(9600);
emon.current(0, 111.1); // Calibration factor
}
void loop() {
double Irms = emon.calcIrms(1480);
if(Irms > 15.0 || Irms < 2.0) { // Thresholds for 20A circuit
trigger_alarm(analogRead(A1)); // Additional temp sensor
}
delay(5000);
}
void trigger_alarm(int temp) {
// Implement notification logic
}
When seconds count, follow this mental checklist:
- Check UPS LCD/network status first (80% of cases)
- Isolate PDU branches while monitoring load changes
- Listen for capacitor whine (high-pitched squealing)
- Look for discolored rack rails near hot components
- Verify redundant power supplies aren't fighting
Combine multiple data streams into a real-time dashboard using this Prometheus config snippet:
- job_name: 'rack_health'
metrics_path: '/probe'
static_configs:
- targets:
- 'ups1:9100' # SNMP exporter
- 'pdu1:9091' # Custom exporter
- 'thermal-cam:8080'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115
Create alert rules that trigger when multiple sensors agree:
groups:
- name: burning.alerts
rules:
- alert: ElectricalFireRisk
expr: |
(ups_battery_temp > 45 or pdu_outlet_temp > 60)
and (rate(pdu_current_fluctuations[5m]) > 10)
and (thermal_camera_hotspots > 2)
for: 2m
labels:
severity: 'critical'
When a burning smell permeates a server room, time becomes critical. Unlike typical hardware failures that trigger alerts, smoldering components often don't show immediate symptoms in monitoring systems. During a recent incident, we spent hours tracing a burning smell that ultimately came from a failing UPS battery module - dangerously close to our production database server.
Here's a systematic approach to identify the source:
// Pseudocode for emergency diagnostic procedure
function locateBurningComponent() {
// Step 1: Isolate power zones
powerZones = identifyPDUCoverage();
// Step 2: Thermal scanning
thermalData = scanWithIRCamera();
hotSpots = thermalData.filter(temp => temp > threshold);
// Step 3: Airflow analysis
airflowPatterns = mapServerRoomVentilation();
smellConcentration = measureParticulateDensity();
// Step 4: Cross-reference with monitoring
healthStatus = pollSNMP(equipment);
return triangulateSource(hotSpots, smellConcentration, healthStatus);
}
Implement these sensors with a Python monitoring script:
import smbus
from environmental_sensors import AirQualitySensor, ThermalArray
class BurnDetector:
def __init__(self):
self.bus = smbus.SMBus(1)
self.air_sensor = AirQualitySensor(0x48)
self.thermal_cam = ThermalArray(0x60)
def check_for_burn(self):
voc_level = self.air_sensor.read_voc()
temp_grid = self.thermal_cam.scan()
if voc_level > 500 or any(t > 85 for t in temp_grid):
self.trigger_alert()
def trigger_alert(self):
# Integrate with existing monitoring
pass
For critical server rooms, consider this equipment matrix:
Device | Placement | Detection Capability |
---|---|---|
IR Thermal Camera | Ceiling-mounted | 2°C resolution |
VOC Sensor | Intake vents | 1ppm sensitivity |
Acoustic Monitor | Near UPS | Ultrasonic detection |
Major cloud providers implement layered detection:
- Phase 1: Particulate sensors trigger at 0.3 microns
- Phase 2: Machine learning analyzes thermal patterns
- Phase 3: Automated circuit isolation for suspect racks
For smaller setups, a Raspberry Pi with a BME680 sensor can provide basic VOC monitoring at minimal cost.