After a decade of successful Munin deployment across 100+ nodes, I've hit the infrastructure monitoring wall that many growing operations face. The classic pull-based architecture shows its limitations when:
- Node count exceeds 100 servers
- Client-side processing creates timeouts during peak loads
- Plugin overhead becomes unmanageable
The ideal replacement should handle:
# Pseudocode for monitoring system requirements
class MonitoringRequirements:
def __init__(self):
self.push_based = True
self.horizontal_scaling = True
self.metric_cardinality = "high"
self.storage_backend = "time-series_optimized"
self.query_language = "promQL/similar"
Prometheus + Grafana emerges as the strongest candidate for these technical requirements:
# Sample prometheus.yml configuration for high-volume scraping
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
scrape_interval: 30s
static_configs:
- targets: ['node-exporter1:9100', 'node-exporter2:9100']
metric_relabel_configs:
- source_labels: [__name__]
regex: '(node_memory_.*|node_cpu_.*)'
action: keep
Transition phases should include:
- Parallel run period (2-4 weeks)
- Metric mapping between systems
- Gradual plugin replacement
Example metric conversion:
# Munin plugin output vs Prometheus exporter
Munin Format:
memory.value 3848932
Prometheus Format:
# HELP node_memory_used_bytes Memory used in bytes
# TYPE node_memory_used_bytes gauge
node_memory_used_bytes 3848932
Solution | Architecture | Data Model |
---|---|---|
InfluxDB + Telegraf | Push-based | Time-series |
Graphite | Pull-based | Time-series |
VictoriaMetrics | Push/Pull | Prometheus-compatible |
For handling 100+ nodes efficiently:
// Node.js collector example with batching
const { collectDefaultMetrics, Registry } = require('prom-client');
const registry = new Registry();
collectDefaultMetrics({
timeout: 10000,
register: registry,
prefix: 'node_'
});
// Batch metrics push every 30s
setInterval(() => {
const metrics = await registry.metrics();
pushToGateway(metrics);
}, 30000);
After a decade of reliable service, Munin's centralized polling architecture shows its limitations when handling 100+ nodes under load. The classic 5-minute cron-based collection becomes problematic when:
# Typical munin-update behavior under load
/usr/bin/munin-update --debug 2>&1 | logger -t munin-update
# Timeout issues manifest as:
ERROR: Node server42.example.com timed out.
WARNING: Processing took 317 seconds (max 300)
For production environments with 100+ servers, we need:
- Push-based metrics collection (not pull)
- Distributed processing architecture
- Efficient storage backends
- Horizontal scaling capabilities
1. Prometheus + Grafana
# Sample node_exporter systemd unit (push alternative)
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=":9100" \
--collector.systemd \
--collector.filesystem.ignored-mount-points="^/(sys|proc|dev|run)($|/)"
[Install]
WantedBy=multi-user.target
2. InfluxDB + Telegraf
The TICK stack handles high cardinality data better than RRD:
# telegraf.conf snippet for system metrics
[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.system]]
# No interval needed - telegraf handles batching
3. VictoriaMetrics
For Munin-like simplicity with modern scaling:
# victoria-metrics-prod deployment example
docker run -d --name vminsert \
-p 8480:8480 \
victoriametrics/vminsert \
-storageNode=vmstorage:8400
docker run -d --name vmstorage \
-v /data/victoria-data:/storage \
victoriametrics/vmstorage \
-retentionPeriod=12months
Transition gradually using these approaches:
# Parallel collection during transition
#!/bin/bash
# Collect legacy munin data
/usr/share/munin/munin-update
# Send same metrics to new system
node_exporter --web.listen-address=":9101" \
--collector.textfile.directory="/var/lib/munin-node/plugin-state/"
Key metrics to preserve from Munin setup:
- Disk growth trends (per partition)
- System load vs CPU cores ratio
- Memory usage patterns (not just %)
- Network throughput saturation
# PromQL equivalent of munin load average
avg(node_load1{instance=~"web.*"} / on(instance) count(node_cpu_seconds_total{mode="system"}) by(instance))