Solving High-Volume Logstash Scaling: Redis OOM Killer and Elasticsearch Bottlenecks in Centralized Logging Systems


7 views

When dealing with 12+ CentOS 5.8 servers shipping logs via Logstash-Redis-Elasticsearch pipeline, we encountered a perfect storm scenario:

# Critical error from kernel logs
Dec 19 00:44:45 logstash01 kernel: [736965.925863] Killed process 23429 (redis-server) 
total-vm:5493112kB, anon-rss:4248840kB, file-rss:108kB

The current setup has these key characteristics:

  • Shippers tailing /var/log/*/*.log on web servers
  • Redis as transport queue (single node)
  • Elasticsearch showing 141 blocked threads during incident
  • Swap usage at 95% (3813MB allocated, 3628MB used)

The unexpected Apache behavior likely stems from:

# File handle contention scenario
lsof -p $(pgrep httpd) | grep access.log
# Typical output shows multiple processes holding the same file descriptor

When log shippers can't ship, they maintain open file handles, causing:

  • FD exhaustion (check via cat /proc/sys/fs/file-max)
  • Inode lock contention during high write volumes

To prevent OOM killer attacks:

# redis.conf critical settings
maxmemory 4gb
maxmemory-policy allkeys-lru
vm.overcommit_memory = 1

Additional measures:

  • Redis cluster with sharding instead of single node
  • Monitor with redis-cli info memory alerting
  • Separate Redis instance for different log priorities

From the thread dumps, we need multiple fixes:

# elasticsearch.yml optimizations
thread_pool.bulk.queue_size: 1000
thread_pool.index.queue_size: 1000
indices.memory.index_buffer_size: 30%

Indexing pattern recommendations:

  • Daily indices with proper mapping templates
  • Disable _all field when not needed
  • Increase refresh interval to 30s during peaks

For high-volume environments:

# Sample Filebeat -> Kafka -> Logstash pipeline
filebeat.prospectors:
- paths: ["/var/log/*/*.log"]
  fields: {type: "syslog"}

output.kafka:
  hosts: ["kafka01:9092"]
  topic: "raw-logs"
  partition.round_robin:
    reachable_only: false

Key calculations for stable operations:

# Redis sizing
Required Memory = (Avg Event Size * Peak EPS * Retention Seconds) * 1.2

# Elasticsearch nodes
Minimum Data Nodes = (Daily GB Volume * Replicas) / (30 * 0.8)

When dealing with high-traffic web servers generating massive log volumes (500+ load averages), our centralized logging architecture with Logstash-Redis-Elasticsearch exhibited critical failures. The OOM killer terminated Redis at 5.4GB memory usage, causing cascading failures across the logging pipeline and even affecting Apache's ability to serve requests.

Current setup:

Web Servers (12x CentOS 5.8)
  → Logstash Shippers (tail /var/log/*/*.log)
    → Redis Queue (on logstash01)
      → Logstash Indexer
        → Elasticsearch Cluster

From the system logs and thread dumps:

  1. 141 blocked Elasticsearch threads with FUTEX_WAIT states
  2. Redis OOM kill at 4.2GB anonymous RSS usage
  3. Apache performance degradation during log accumulation

The observed Apache performance issues during log pileups stem from:

# When multiple processes access the same log file:
1. Logstash shipper keeps file descriptor open for tailing
2. Apache needs to acquire exclusive lock for writes
3. Linux kernel inotify events create contention
4. Under heavy load, this becomes a blocking operation

Solution: Use copytruncate pattern instead of direct tailing:

input {
  file {
    path => "/var/log/apache2/access.log"
    sincedb_path => "/dev/null"
    start_position => "beginning"
    type => "apache"
    codec => multiline {
      pattern => "^%{TIMESTAMP_ISO8601}"
      negate => true
      what => "previous"
    }
    mode => "read"
    file_completed_action => "log"
    file_completed_log_path => "/tmp/access.log.completed"
  }
}

Key configuration adjustments for redis.conf:

# Memory management
maxmemory 4gb
maxmemory-policy allkeys-lru
maxmemory-samples 10

# Persistence tradeoffs
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error no
rdbcompression yes
rdbchecksum yes

# Performance tuning
tcp-backlog 511
timeout 0
tcp-keepalive 60

Modify elasticsearch.yml:

thread_pool:
  index:
    size: 30
    queue_size: 1000
  search:
    size: 20
    queue_size: 1000
  bulk:
    size: 20
    queue_size: 1000

indices.memory.index_buffer_size: 30%
indices.fielddata.cache.size: 40%

For production-grade resilience:

                    +---------------+
                    |  Load Balancer |
                    +-------┬-------+
                            |
         +------------------+------------------+
         |                  |                  |
+--------v-------+ +--------v-------+ +--------v-------+
| Redis Cluster  | | Redis Cluster  | | Redis Cluster  |
| (Master+Slave) | | (Master+Slave) | | (Master+Slave) |
+--------+-------+ +--------+-------+ +--------+-------+
         |                  |                  |
+--------v-------+ +--------v-------+ +--------v-------+
| Logstash       | | Logstash       | | Logstash       |
| Indexer Pool   | | Indexer Pool   | | Indexer Pool   |
+--------+-------+ +--------+-------+ +--------+-------+
         |                  |                  |
+--------v-------+ +--------v-------+ +--------v-------+
| Elasticsearch  | | Elasticsearch  | | Elasticsearch  |
| Data Node      | | Data Node      | | Data Node      |
+----------------+ +----------------+ +----------------+

Essential metrics to track:

# Redis
redis-cli info memory
redis-cli info stats
redis-cli info persistence

# Elasticsearch
GET /_nodes/stats/thread_pool
GET /_cat/thread_pool?v
GET /_nodes/stats/indices

Implement Logstash backpressure detection:

filter {
  metrics {
    meter => "events"
    add_tag => "metric"
  }
}

output {
  if "metric" in [tags] {
    exec {
      command => "check_backpressure.sh %{[events][count]}"
    }
  }
}