This post is about tuning Elasticsearch Disk Usage. In it, we discuss disk usage tuning techniques, strategies, and recommendations specific to Elasticsearch 5.0 or onwards.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

We have already discussed the “The Authoritative Guide to Elasticsearch Performance Tuning” in a three part tutorial series to introduce some general tips and methods for performance tuning, explaining at each step the most relevant system configuration settings and metrics. 

We have also discussed the “How to Maximize Elasticsearch Indexing Performance” in a three part tutorial series to introduce some general tips and methods to achieve maximum indexing throughput and reduce monitoring and management load. 

This post focuses on disk usage tuning techniques, strategies and recommendations specific to Elasticsearch 5.0 or onwards.

Index Mapping

Elasticsearch by default, indexes and adds doc values to most fields so that they can be searched and aggregated out of the box. For instance if you have a numeric field called name that you need to run histograms on but that you never need to filter on, you can safely disable indexing on this field in your mappings:

curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d'{
 "mappings": {
   "type": {
     "properties": {
       "name": {
         "type": "integer",
         "index": false
       }
     }
   }
 }
}'

text fields store normalization factors in the index in order to be able to score documents.Text fields are not used for sorting and seldom used for aggregations (although the significant terms aggregation is a notable exception). If you need to index structured content such as email addresses, hostnames, status codes, or tags, it is likely that you should rather use a keyword field.

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "my_type": {
     "properties": {
       "tags": {
         "type":  "keyword"
       }
     }
   }
 }
}'

If you only need matching capabilities on a text field but do not care about the produced scores, you can configure elasticsearch to not write norms to the index:

curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "type": {
     "properties": {
       "name": {
         "type": "text",
         "norms": false
       }
     }
   }
 }
}'

text fields also store frequencies and positions in the index by default. Frequencies are used to compute scores and positions are used to run phrase queries. If you do not need to run phrase queries, you can tell elasticsearch to not index positions:

curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "type": {
     "properties": {
       "name": {
         "type": "text",
         "index_options": "freqs"
       }
     }
   }
 }
}'

Furthermore if you do not care about scoring either, you can configure elasticsearch to just index matching documents for every term. You will still be able to search on this field, but phrase queries will raise errors and scoring will assume that terms appear only once in every document.

curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "type": {
     "properties": {
       "name": {
         "type": "text",
         "norms": false,
         "index_options": "freqs"
       }
     }
   }
 }
}'

Don’t Use Default Dynamic String Mappings

The default dynamic string mappings will index string fields both as text and keyword. This is wasteful if you only need one of them. Typically an id field will only need to be indexed as a keyword while a body field will only need to be indexed as a text field.

This can be disabled by either configuring explicit mappings on string fields or setting up dynamic templates that will map string fields as either text or keyword.

For instance, here is a template that can be used in order to only map string fields as keyword:

curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "type": {
     "dynamic_templates": [
       {
         "strings": {
           "match_mapping_type": "string",
           "mapping": {
             "type": "keyword"
           }
         }
       }
     ]
   }
 }
}'

If we wanted to map all integer fields as integer instead of long, and all string fields as both text and keyword, we could use the following template:

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "my_type": {
     "dynamic_templates": [
       {
         "integers": {
           "match_mapping_type": "long",
           "mapping": {
             "type": "integer"
           }
         }
       },
       {
         "strings": {
           "match_mapping_type": "string",
           "mapping": {
             "type": "text",
             "fields": {
               "raw": {
                 "type":  "keyword",
                 "ignore_above": 256
               }
             }
           }
         }
       }
     ]
   }
 }
}'

Here, the my_integer field is mapped as an integer and my_string field is mapped as a text, with a keyword multi field.

curl -XPUT 'ES_HOST:ES_PORT/my_index/my_type/1?pretty' -H 'Content-Type: application/json' -d '{
 "my_integer": 5,
 "my_string": "Some string"
}'

Disable _All

The _all field is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored. This means that it can be searched, but not retrieved. The _all field indexes the value of all fields of a document and can use significant space. If you never need to search against all fields at the same time, it can be disabled.

Here, the _all field in type_1 is enabled and _all field in type_2 is completely disabled.

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "type_1": {
     "properties": {...}
   },
   "type_2": {
     "_all": {
       "enabled": false
     },
     "properties": {...}
   }
 }
}'

Use Best_Compression

The _source and stored fields can easily take a non negligible amount of disk space. They can be compressed more aggressively by using the best_compression codec. The default value compresses stored data with LZ4 compression, but this can be set to best_compression using static index setting index.codec which uses DEFLATE for a higher compression ratio, at the expense of slower stored fields performance.

Use the Smallest Sufficient Numeric Type

As far as integer types (byte, short, integer and long) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that given that storage is optimized based on the actual values that are stored, picking one type over another one will have no impact on storage requirements.

For floating-point types, it is often more efficient to store floating-point data into an integer using a scaling factor, which is what the scaled_float type does under the hood. For instance, a pricefield could be stored in a scaled_float with a scaling_factor of 100. All APIs would work as if the field was stored as a double, but under the hood elasticsearch would be working with the number of cents, price*100, which is an integer. This is mostly helpful to save disk space since integers are way easier to compress than floating points. scaled_float is also fine to use in order to trade accuracy for disk space.

If scaled_float is not a good fit, then you should pick the smallest type that is enough for the use-case among the floating-point types: double, float and half_float.

Below is an example of configuring a mapping with numeric fields:

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "my_type": {
     "properties": {
       "number_of_bytes": {
         "type": "integer"
       },
       "time_in_seconds": {
         "type": "float"
       },
       "price": {
         "type": "scaled_float",
         "scaling_factor": 100
       }
     }
   }
 }
}'

Other Articles

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus