Optimizing Elasticsearch Disk Usage: Why Your Data Isn’t Compressing and How to Fix It


2 views

When working with Elasticsearch 1.3.2 on CentOS 6.5, many developers expect automatic compression of stored data. The documentation suggests compression ratios between 50-95%, but in practice, you might encounter the opposite - data expanding to 4x its original size.

# Example disk usage comparison
622M    logstash-2014.10.07/  # ES storage
173M    original log files    # Raw data

While Elasticsearch does enable compression by default, several factors can affect its effectiveness:

  • Index settings inheritance
  • Field data types and mappings
  • Shard configuration

Try these settings in your elasticsearch.yml:

index.codec: best_compression
index.store.compress.stored: true
index.store.compress.tv: true

For per-index settings (recommended):

PUT /your_index/_settings
{
  "index.codec": "best_compression",
  "index.store.compress.stored": true,
  "index.number_of_replicas": 0  # Temporarily for testing
}

Your configuration shows:

index.number_of_shards: 5
index.number_of_replicas: 1

This immediately doubles storage requirements (primary + replica). Consider:

  1. Reducing shard count if your dataset is small
  2. Using _forcemerge after changing compression settings

Create a test index with different settings:

PUT /compression_test
{
  "settings": {
    "index.number_of_shards": 1,
    "index.codec": "best_compression",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_analyzer",
        "store": true
      }
    }
  }
}

Use the Indices Stats API to verify:

GET /_stats/store?human&pretty

Key metrics to watch:

  • store.size vs store.size_in_bytes
  • segments.count
  • segments.memory_in_bytes

For long-term storage:

PUT /_ilm/policy/cold_storage_policy
{
  "policy": {
    "phases": {
      "cold": {
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      }
    }
  }
}

Combine with:

PUT /_template/compressed_template
{
  "index_patterns": ["logs-*"],
  "settings": {
    "index.lifecycle.name": "cold_storage_policy",
    "index.codec": "best_compression"
  }
}

When comparing 173MB of raw log files with Elasticsearch's 622MB index for the same data, any engineer would raise an eyebrow. Let me walk through what's happening under the hood and how to fix it.

index/
├── segments_1        # Segment metadata
├── _0.cfs            # Compound file (compressed)
├── _0.cfe            # Compound file entries
├── _0.si             # Segment info
├── _0.fdt            # Field data (often largest)
├── _0.fdx            # Field index
├── _0.fnm            # Field names
└── _0.pos            # Position data

The space consumption comes from multiple factors beyond just raw data storage:

While Elasticsearch 1.3.2 has compression enabled by default, we can optimize further:

# In elasticsearch.yml
index:
  store:
    compress:
      stored: true     # Explicitly enable stored fields compression
    type: niofs        # For better performance with compression
    
  codec:
    postings_format: 
      type: pulsing    # Reduces term dictionary size

Proper field mapping significantly impacts storage:

PUT /logstash-2014.10.07/_mapping
{
  "properties": {
    "message": {
      "type": "string",
      "index": "analyzed",
      "norms": { "enabled": false },  # Saves space
      "index_options": "docs"         # Minimal indexing
    },
    "timestamp": {
      "type": "date",
      "doc_values": true              # More efficient than fielddata
    }
  }
}

For time-series data like logs:

# Create a new index with optimized settings
PUT /logstash-optimized
{
  "settings": {
    "index.refresh_interval": "30s",
    "number_of_shards": 3,
    "number_of_replicas": 0,       # Disable replicas temporarily
    "store.throttle.max_bytes_per_sec": "50mb"
  }
}

Implement regular optimization routines:

# Force merge segments (run during low traffic)
POST /logstash-2014.10.07/_forcemerge?max_num_segments=3

# Clear cache (helpful before measurements)
POST /_cache/clear

For ongoing management, consider implementing a hot-warm architecture with different compression settings for older indices.