We have already discussed the “The Authoritative Guide to Elasticsearch Performance Tuning” in a three part tutorial series to introduce some general tips and methods for performance tuning, explaining at each step the most relevant system configuration settings and metrics. The tutorials already covered in the same context are as follows:

The Authoritative Guide to Elasticsearch Performance Tuning (Part 1)

The Authoritative Guide to Elasticsearch Performance Tuning (Part 2)

The Authoritative Guide to Elasticsearch Performance Tuning (Part 3)

The aim of this tutorial is to recommend some Performance Tuning techniques, strategies and recommendations specific to Elasticsearch 5.0 or onwards.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Elasticsearch 5.0.0 had really been a major release after Elasticsearch 2.x version and it does have something for everyone. It is a part of a wider release of the Elastic Stack which lines-up version numbers of all the stack products. Kibana, Logstash, Beats, Elasticsearch - are all version 5.0 now. It is the fastest, safest, most resilient, easiest to use version of Elasticsearch ever, and it comes with a boatload of enhancements and new features.

Avoid Large Documents

Large documents are usually not practical and put more stress on network, memory usage and disk, even for search requests that do not request the _source since Elasticsearch needs to fetch the _id of the document in all cases, and the cost of getting this field is bigger for large documents due to how the filesystem cache works. Indexing this document can use an amount of memory that is a multiplier of the original size of the document. Proximity search (phrase queries for instance) and highlighting also become more expensive since their cost directly depends on the size of the original document.

In order to perform highlighting, the actual content of the field is required. If the field in question is stored (has store set to true in the mapping) it will be used, otherwise, the actual _source will be loaded and the relevant field will be extracted from it.

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d '{
   "query" : {
       "match": { "content": "qbox" }
   },
   "highlight" : {
       "fields" : {
           "content" : {}
       }
   }
}'

The default http.max_context_length value is set to 100MB. Elasticsearch will refuse to index any document that is larger than that. We might decide to increase this particular setting, but Lucene still has a limit of about 2GB.

It is sometimes useful to reconsider what the unit of information should be. For instance, the fact you want to make books searchable doesn’t necessarily mean that a document should consist of a whole book. It might be a better idea to use chapters or even paragraphs as documents, and then have a property in these documents that identifies which book they belong to. This does not only avoid the issues with large documents, it also makes the search experience better.

Avoid Fetching Large Result Sets

Elasticsearch is designed as a search engine, which makes it very good at getting back the top documents that match a query. However, it is not as good for workloads that fall into the database domain, such as retrieving all documents that match a particular query. If you need to do this, make sure to use the Scroll API.

In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the “search context” alive, eg ?scroll=1m.

curl -XPOST 'localhost:9200/twitter/tweet/_search?scroll=1m&pretty' -H 'Content-Type: application/json' -d '{
   "size": 100,
   "query": {
       "match" : {
           "title" : "qbox"
       }
   }
}'

The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.

curl -XPOST 'localhost:9200/_search/scroll?pretty' -H 'Content-Type: application/json' -d '{
   "scroll" : "1m",
   "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}'

Please refer to “Searching and Fetching Large Datasets in Elasticsearch Efficiently” for in depth guide to Scroll API.

Avoid Sparsity and Maximize Density of Documents

The data-structures behind Lucene, which Elasticsearch relies on in order to index and store data, work best with dense data, ie. when all documents have the same fields. This is especially true for fields that have norms enabled (which is the case for text fields by default) or doc values enabled (which is the case for numerics, date, ip and keyword by default).

If an index has M documents, norms will require M bytes of storage per field, even for fields that only appear in a small fraction of the documents of the index. It is slightly more complex with doc values due to the fact that doc values have multiple ways that they can be encoded depending on the type of field and on the actual data that the field stores but the problem is very similar. The fielddata, which was used in Elasticsearch pre-2.0 before being replaced with doc values, also suffered from this issue, except that the impact was only on the memory footprint since fielddata was not explicitly materialized on disk.

The most notable impact of sparsity is on storage requirements, it also has an impact on indexing speed and search speed since these bytes for documents that do not have a field still need to be written at index time and skipped over at search time. Sparsity also affect the efficiency of the inverted index (used to index text/keyword fields) and dimensional points (used to index geo_point and numerics) but to a lesser extent.

Here are some recommendations that can help avoid sparsity:

Avoid Putting Unrelated Data in the Same Index

We should avoid putting documents that have totally different structures into the same index in order to avoid sparsity. It is often better to put these documents into different indices and we could also consider giving fewer shards to these smaller indices since they will contain fewer documents overall.

This rule does not apply in the case where you need to use parent/child relations between your documents since this feature is only supported on documents that live in the same index.

Normalize Document Structures

If you really need to put different kinds of documents in the same index, maybe there are still some opportunities to reduce sparsity. For instance if all documents in the index have a timestamp field but some call it timestamp and others call it creation_date, it would help to rename it so that all documents have the same field name for the same data.

Avoid types

Let’s consider this mapping of two types in the data index:

{
   "data": {
      "mappings": {
         "people": {
            "properties": {
               "name": {
                  "type": "string",
               },
               "address": {
                  "type": "string"
               }
            }
         },
         "transactions": {
            "properties": {
               "timestamp": {
                  "type": "date",
                  "format": "strict_date_optional_time"
               },
               "message": {
                  "type": "string"
               }
            }
         }
      }
   }
}

Each type defines two fields ("name"/"address" and "timestamp"/"message" respectively). It may look like they are independent, but under the covers Lucene will create a single mapping which would look something like this:

{
   "data": {
      "mappings": {
        "_type": {
          "type": "string",
          "index": "not_analyzed"
        },
        "name": {
          "type": "string"
        }
        "address": {
          "type": "string"
        }
        "timestamp": {
          "type": "long"
        }
        "message": {
          "type": "string"
        }
      }
   }
}

The mappings are essentially flattened into a single, global schema for the entire index. That’s the reason why two types cannot define conflicting fields: Lucene wouldn’t know what to do when the mappings are flattened together.

Types might sound like a good way to store multiple tenants in a single index. They are not: given that types store everything in a single index, having multiple types that have different fields in a single index will also cause problems due to sparsity as described above. If your types do not have very similar mappings, you might want to consider moving them to a dedicated index.

Disable Norms and Doc_Values on Sparse Fields

If none of the above recommendations apply in your case, you might want to check whether you actually need norms and doc_values on your sparse fields. norms can be disabled if producing scores is not necessary on a field, this is typically true for fields that are only used for filtering. doc_values can be disabled on fields that are neither used for sorting nor for aggregations. Beware that this decision should not be made lightly since these parameters cannot be changed on a live index, so you would have to reindex if you realize that you need norms or doc_values.

Norms can be disabled (but not re-enabled) after the fact, using the PUT mapping API like so:

curl -XPUT 'localhost:9200/my_index/_mapping/my_type?pretty' -H 'Content-Type: application/json' -d '{
 "properties": {
   "title": {
     "type": "text",
     "norms": false
   }
 }
}'

Note: Norms will not be removed instantly, but will be removed as old segments are merged into new segments as you continue indexing new documents. Any score computation on a field that has had norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms.

All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:

curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "my_type": {
     "properties": {
       "status_code": {
         "type":       "keyword"
       },
       "session_id": {
         "type":       "keyword",
         "doc_values": false
       }
     }
   }
 }
}'

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus