When dealing with 12+ CentOS 5.8 servers shipping logs via Logstash-Redis-Elasticsearch pipeline, we encountered a perfect storm scenario:
# Critical error from kernel logs
Dec 19 00:44:45 logstash01 kernel: [736965.925863] Killed process 23429 (redis-server)
total-vm:5493112kB, anon-rss:4248840kB, file-rss:108kB
The current setup has these key characteristics:
- Shippers tailing
/var/log/*/*.log
on web servers - Redis as transport queue (single node)
- Elasticsearch showing 141 blocked threads during incident
- Swap usage at 95% (3813MB allocated, 3628MB used)
The unexpected Apache behavior likely stems from:
# File handle contention scenario
lsof -p $(pgrep httpd) | grep access.log
# Typical output shows multiple processes holding the same file descriptor
When log shippers can't ship, they maintain open file handles, causing:
- FD exhaustion (check via
cat /proc/sys/fs/file-max
) - Inode lock contention during high write volumes
To prevent OOM killer attacks:
# redis.conf critical settings
maxmemory 4gb
maxmemory-policy allkeys-lru
vm.overcommit_memory = 1
Additional measures:
- Redis cluster with sharding instead of single node
- Monitor with
redis-cli info memory
alerting - Separate Redis instance for different log priorities
From the thread dumps, we need multiple fixes:
# elasticsearch.yml optimizations
thread_pool.bulk.queue_size: 1000
thread_pool.index.queue_size: 1000
indices.memory.index_buffer_size: 30%
Indexing pattern recommendations:
- Daily indices with proper mapping templates
- Disable
_all
field when not needed - Increase refresh interval to 30s during peaks
For high-volume environments:
# Sample Filebeat -> Kafka -> Logstash pipeline
filebeat.prospectors:
- paths: ["/var/log/*/*.log"]
fields: {type: "syslog"}
output.kafka:
hosts: ["kafka01:9092"]
topic: "raw-logs"
partition.round_robin:
reachable_only: false
Key calculations for stable operations:
# Redis sizing
Required Memory = (Avg Event Size * Peak EPS * Retention Seconds) * 1.2
# Elasticsearch nodes
Minimum Data Nodes = (Daily GB Volume * Replicas) / (30 * 0.8)
When dealing with high-traffic web servers generating massive log volumes (500+ load averages), our centralized logging architecture with Logstash-Redis-Elasticsearch exhibited critical failures. The OOM killer terminated Redis at 5.4GB memory usage, causing cascading failures across the logging pipeline and even affecting Apache's ability to serve requests.
Current setup:
Web Servers (12x CentOS 5.8)
→ Logstash Shippers (tail /var/log/*/*.log)
→ Redis Queue (on logstash01)
→ Logstash Indexer
→ Elasticsearch Cluster
From the system logs and thread dumps:
- 141 blocked Elasticsearch threads with FUTEX_WAIT states
- Redis OOM kill at 4.2GB anonymous RSS usage
- Apache performance degradation during log accumulation
The observed Apache performance issues during log pileups stem from:
# When multiple processes access the same log file:
1. Logstash shipper keeps file descriptor open for tailing
2. Apache needs to acquire exclusive lock for writes
3. Linux kernel inotify events create contention
4. Under heavy load, this becomes a blocking operation
Solution: Use copytruncate pattern instead of direct tailing:
input {
file {
path => "/var/log/apache2/access.log"
sincedb_path => "/dev/null"
start_position => "beginning"
type => "apache"
codec => multiline {
pattern => "^%{TIMESTAMP_ISO8601}"
negate => true
what => "previous"
}
mode => "read"
file_completed_action => "log"
file_completed_log_path => "/tmp/access.log.completed"
}
}
Key configuration adjustments for redis.conf:
# Memory management
maxmemory 4gb
maxmemory-policy allkeys-lru
maxmemory-samples 10
# Persistence tradeoffs
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error no
rdbcompression yes
rdbchecksum yes
# Performance tuning
tcp-backlog 511
timeout 0
tcp-keepalive 60
Modify elasticsearch.yml:
thread_pool:
index:
size: 30
queue_size: 1000
search:
size: 20
queue_size: 1000
bulk:
size: 20
queue_size: 1000
indices.memory.index_buffer_size: 30%
indices.fielddata.cache.size: 40%
For production-grade resilience:
+---------------+
| Load Balancer |
+-------┬-------+
|
+------------------+------------------+
| | |
+--------v-------+ +--------v-------+ +--------v-------+
| Redis Cluster | | Redis Cluster | | Redis Cluster |
| (Master+Slave) | | (Master+Slave) | | (Master+Slave) |
+--------+-------+ +--------+-------+ +--------+-------+
| | |
+--------v-------+ +--------v-------+ +--------v-------+
| Logstash | | Logstash | | Logstash |
| Indexer Pool | | Indexer Pool | | Indexer Pool |
+--------+-------+ +--------+-------+ +--------+-------+
| | |
+--------v-------+ +--------v-------+ +--------v-------+
| Elasticsearch | | Elasticsearch | | Elasticsearch |
| Data Node | | Data Node | | Data Node |
+----------------+ +----------------+ +----------------+
Essential metrics to track:
# Redis
redis-cli info memory
redis-cli info stats
redis-cli info persistence
# Elasticsearch
GET /_nodes/stats/thread_pool
GET /_cat/thread_pool?v
GET /_nodes/stats/indices
Implement Logstash backpressure detection:
filter {
metrics {
meter => "events"
add_tag => "metric"
}
}
output {
if "metric" in [tags] {
exec {
command => "check_backpressure.sh %{[events][count]}"
}
}
}