Elasticsearch 5.0 Disk Usage Tuning
Posted by Adam Vanderbush August 29, 2017This post is about tuning Elasticsearch Disk Usage. In it, we discuss disk usage tuning techniques, strategies, and recommendations specific to Elasticsearch 5.0 or onwards.
For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.“
We have already discussed the “The Authoritative Guide to Elasticsearch Performance Tuning” in a three part tutorial series to introduce some general tips and methods for performance tuning, explaining at each step the most relevant system configuration settings and metrics.
We have also discussed the “How to Maximize Elasticsearch Indexing Performance” in a three part tutorial series to introduce some general tips and methods to achieve maximum indexing throughput and reduce monitoring and management load.
This post focuses on disk usage tuning techniques, strategies and recommendations specific to Elasticsearch 5.0 or onwards.
Index Mapping
Elasticsearch by default, indexes and adds doc values to most fields so that they can be searched and aggregated out of the box. For instance if you have a numeric field called name that you need to run histograms on but that you never need to filter on, you can safely disable indexing on this field in your mappings:
curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d'{ "mappings": { "type": { "properties": { "name": { "type": "integer", "index": false } } } } }'
text
fields store normalization factors in the index in order to be able to score documents.Text fields are not used for sorting and seldom used for aggregations (although the significant terms aggregation is a notable exception). If you need to index structured content such as email
addresses
, hostnames
, status codes
, or tags
, it is likely that you should rather use a keyword field.
curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "my_type": { "properties": { "tags": { "type": "keyword" } } } } }'
If you only need matching capabilities on a text field but do not care about the produced scores, you can configure elasticsearch to not write norms to the index:
curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "type": { "properties": { "name": { "type": "text", "norms": false } } } } }'
text
fields also store frequencies and positions in the index by default. Frequencies are used to compute scores and positions are used to run phrase queries. If you do not need to run phrase queries, you can tell elasticsearch to not index positions:
curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "type": { "properties": { "name": { "type": "text", "index_options": "freqs" } } } } }'
Furthermore if you do not care about scoring either, you can configure elasticsearch to just index matching documents for every term. You will still be able to search on this field, but phrase queries will raise errors and scoring will assume that terms appear only once in every document.
curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "type": { "properties": { "name": { "type": "text", "norms": false, "index_options": "freqs" } } } } }'
Don’t Use Default Dynamic String Mappings
The default dynamic string mappings will index string fields both as text and keyword. This is wasteful if you only need one of them. Typically an id field will only need to be indexed as a keyword
while a body field will only need to be indexed as a text
field.
This can be disabled by either configuring explicit mappings on string fields or setting up dynamic templates that will map string fields as either text or keyword.
For instance, here is a template that can be used in order to only map string fields as keyword
:
curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "type": { "dynamic_templates": [ { "strings": { "match_mapping_type": "string", "mapping": { "type": "keyword" } } } ] } } }'
If we wanted to map all integer fields as integer instead of long, and all string fields as both text and keyword, we could use the following template:
curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "my_type": { "dynamic_templates": [ { "integers": { "match_mapping_type": "long", "mapping": { "type": "integer" } } }, { "strings": { "match_mapping_type": "string", "mapping": { "type": "text", "fields": { "raw": { "type": "keyword", "ignore_above": 256 } } } } } ] } } }'
Here, the my_integer
field is mapped as an integer and my_string
field is mapped as a text
, with a keyword
multi field.
curl -XPUT 'ES_HOST:ES_PORT/my_index/my_type/1?pretty' -H 'Content-Type: application/json' -d '{ "my_integer": 5, "my_string": "Some string" }'
Disable _All
The _all
field is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored. This means that it can be searched, but not retrieved. The _all
field indexes the value of all fields of a document and can use significant space. If you never need to search against all fields at the same time, it can be disabled.
Here, the _all
field in type_1
is enabled and _all
field in type_2
is completely disabled.
curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "type_1": { "properties": {...} }, "type_2": { "_all": { "enabled": false }, "properties": {...} } } }'
Use Best_Compression
The _source
and stored fields can easily take a non negligible amount of disk space. They can be compressed more aggressively by using the best_compression
codec. The default value compresses stored data with LZ4
compression, but this can be set to best_compression
using static index setting index.codec
which uses DEFLATE for a higher compression ratio, at the expense of slower stored fields performance.
Use the Smallest Sufficient Numeric Type
As far as integer types (byte
, short
, integer
and long
) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that given that storage is optimized based on the actual values that are stored, picking one type over another one will have no impact on storage requirements.
For floating-point types, it is often more efficient to store floating-point data into an integer using a scaling factor, which is what the scaled_float
type does under the hood. For instance, a pricefield could be stored in a scaled_float
with a scaling_factor
of 100. All APIs would work as if the field was stored as a double, but under the hood elasticsearch would be working with the number of cents, price*100, which is an integer. This is mostly helpful to save disk space since integers are way easier to compress than floating points. scaled_float
is also fine to use in order to trade accuracy for disk space.
If scaled_float
is not a good fit, then you should pick the smallest type that is enough for the use-case among the floating-point types: double
, float
and half_float
.
Below is an example of configuring a mapping with numeric fields:
curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "my_type": { "properties": { "number_of_bytes": { "type": "integer" }, "time_in_seconds": { "type": "float" }, "price": { "type": "scaled_float", "scaling_factor": 100 } } } } }'
Other Articles
- How to Lock Down Elasticsearch, Kibana, and Logstash and Maintain Security
- How to Secure Your Elasticsearch with Your Own Authentication Plugin
- How to Index NMAP Port Scan Results into Elasticsearc
- How to Import from CSV into Elasticsearch via Logstash and Sincedb
- Introduction to the Logstash Translate Filter
Give it a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.