When working with Elasticsearch 1.3.2 on CentOS 6.5, many developers expect automatic compression of stored data. The documentation suggests compression ratios between 50-95%, but in practice, you might encounter the opposite - data expanding to 4x its original size.
# Example disk usage comparison
622M logstash-2014.10.07/ # ES storage
173M original log files # Raw data
While Elasticsearch does enable compression by default, several factors can affect its effectiveness:
- Index settings inheritance
- Field data types and mappings
- Shard configuration
Try these settings in your elasticsearch.yml
:
index.codec: best_compression
index.store.compress.stored: true
index.store.compress.tv: true
For per-index settings (recommended):
PUT /your_index/_settings
{
"index.codec": "best_compression",
"index.store.compress.stored": true,
"index.number_of_replicas": 0 # Temporarily for testing
}
Your configuration shows:
index.number_of_shards: 5
index.number_of_replicas: 1
This immediately doubles storage requirements (primary + replica). Consider:
- Reducing shard count if your dataset is small
- Using
_forcemerge
after changing compression settings
Create a test index with different settings:
PUT /compression_test
{
"settings": {
"index.number_of_shards": 1,
"index.codec": "best_compression",
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer",
"store": true
}
}
}
}
Use the Indices Stats API to verify:
GET /_stats/store?human&pretty
Key metrics to watch:
store.size
vsstore.size_in_bytes
segments.count
segments.memory_in_bytes
For long-term storage:
PUT /_ilm/policy/cold_storage_policy
{
"policy": {
"phases": {
"cold": {
"actions": {
"forcemerge": {
"max_num_segments": 1
}
}
}
}
}
}
Combine with:
PUT /_template/compressed_template
{
"index_patterns": ["logs-*"],
"settings": {
"index.lifecycle.name": "cold_storage_policy",
"index.codec": "best_compression"
}
}
When comparing 173MB of raw log files with Elasticsearch's 622MB index for the same data, any engineer would raise an eyebrow. Let me walk through what's happening under the hood and how to fix it.
index/
├── segments_1 # Segment metadata
├── _0.cfs # Compound file (compressed)
├── _0.cfe # Compound file entries
├── _0.si # Segment info
├── _0.fdt # Field data (often largest)
├── _0.fdx # Field index
├── _0.fnm # Field names
└── _0.pos # Position data
The space consumption comes from multiple factors beyond just raw data storage:
While Elasticsearch 1.3.2 has compression enabled by default, we can optimize further:
# In elasticsearch.yml
index:
store:
compress:
stored: true # Explicitly enable stored fields compression
type: niofs # For better performance with compression
codec:
postings_format:
type: pulsing # Reduces term dictionary size
Proper field mapping significantly impacts storage:
PUT /logstash-2014.10.07/_mapping
{
"properties": {
"message": {
"type": "string",
"index": "analyzed",
"norms": { "enabled": false }, # Saves space
"index_options": "docs" # Minimal indexing
},
"timestamp": {
"type": "date",
"doc_values": true # More efficient than fielddata
}
}
}
For time-series data like logs:
# Create a new index with optimized settings
PUT /logstash-optimized
{
"settings": {
"index.refresh_interval": "30s",
"number_of_shards": 3,
"number_of_replicas": 0, # Disable replicas temporarily
"store.throttle.max_bytes_per_sec": "50mb"
}
}
Implement regular optimization routines:
# Force merge segments (run during low traffic)
POST /logstash-2014.10.07/_forcemerge?max_num_segments=3
# Clear cache (helpful before measurements)
POST /_cache/clear
For ongoing management, consider implementing a hot-warm architecture with different compression settings for older indices.