Elasticsearch 5.0.0 had really been a major release after Elasticsearch 2.x version and it does have something for everyone. It is a part of a wider release of the Elastic Stack which lines-up version numbers of all the stack products. Kibana, Logstash, Beats, Elasticsearch - are all version 5.0 now. It is the fastest, safest, most resilient, easiest to use version of Elasticsearch ever, and it comes with a boatload of enhancements and new features.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

We have already discussed the “The Authoritative Guide to Elasticsearch Performance Tuning” in a three part tutorial series to introduce some general tips and methods for performance tuning, explaining at each step the most relevant system configuration settings and metrics. The tutorials already covered in the same context are as follows:

The indexing decisions are quite important, and they have big impact on how you can search your data. If it's a string field, should it be tokenized and normalized? If so, how? If a numeric field, what precision is required? There are many more field types like date-time fields, geospatial shapes, and parent/child relationships that require special care.

We have already discussed the “How to Maximize Elasticsearch Indexing Performance” in a three part tutorial series to introduce some general tips and methods to achieve maximum indexing throughput and reduce monitoring and management load. The tutorials already covered in the same context are as follows:

The aim of this tutorial is to recommend some Search Tuning techniques, strategies and recommendations specific to Elasticsearch 5.0 or onwards.

Document Modeling

Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values. For instance, the following document:

curl -XPUT 'localhost:9200/my_index/my_type/1?pretty' -H 'Content-Type: application/json' -d '{
 "group" : "fans",
 "user" : [
   {
     "first" : "John",
     "last" :  "Smith"
   },
   {
     "first" : "Alice",
     "last" :  "White"
   }
 ]
}'

would be transformed internally into a document that looks more like this:

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

If you need to index arrays of objects and to maintain the independence of each object in the array, you should use the nested datatype instead of the object datatype. Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others, with the nested query:

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "my_type": {
     "properties": {
       "user": {
         "type": "nested"
       }
     }
   }
 }
}'
curl -XPUT 'ES_HOST:ES_PORT/my_index/my_type/1?pretty' -H 'Content-Type: application/json' -d '{
 "group" : "fans",
 "user" : [
   {
     "first" : "John",
     "last" :  "Smith"
   },
   {
     "first" : "Alice",
     "last" :  "White"
   }
 ]
}'
curl -XGET 'ES_HOST:ES_PORT/my_index/_search?pretty' -H 'Content-Type: application/json' -d '{
 "query": {
   "nested": {
     "path": "user",
     "query": {
       "bool": {
         "must": [
           { "match": { "user.first": "Alice" }},
           { "match": { "user.last":  "Smith" }}
         ]
       }
     }
   }
 }
}'
curl -XGET 'ES_HOST:ES_PORT/my_index/_search?pretty' -H 'Content-Type: application/json' -d '{
 "query": {
   "nested": {
     "path": "user",
     "query": {
       "bool": {
         "must": [
           { "match": { "user.first": "Alice" }},
           { "match": { "user.last":  "White" }}
         ]
       }
     },
     "inner_hits": {
       "highlight": {
         "fields": {
           "user.first": {}
         }
       }
     }
   }
 }
}'

Nested objects are useful when there is one main entity, like a blogpost, with a limited number of closely related but less important entities, such as comments. It is useful to be able to find blog posts based on the content of the comments, and the nested query and filter provide for fast query-time joins.

The disadvantages of the nested model are as follows:

  • To add, change, or delete a nested document, the whole document must be reindexed. This becomes more costly the more nested documents there are.

  • Search requests return the whole document, not just the matching nested documents. Although there are plans afoot to support returning the best -matching nested documents with the root document, this is not yet supported.

Sometimes you need a complete separation between the main document and its associated entities. This separation is provided by the parent-child relationship.

A parent-child relationship can be established between documents in the same index by making one mapping type the parent of another:

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "my_parent": {},
   "my_child": {
     "_parent": {
       "type": "my_parent"
     }
   }
 }
}'
curl -XPUT 'ES_HOST:ES_PORT/my_index/my_parent/1?pretty' -H 'Content-Type: application/json' -d '{
 "text": "This is a parent document"
}'
curl -XPUT 'ES_HOST:ES_PORT/my_index/my_child/2?parent=1&pretty' -H 'Content-Type: application/json' -d '{
 "text": "This is a child document"
}'
curl -XPUT 'ES_HOST:ES_PORT/my_index/my_child/3?parent=1&refresh=true&pretty' -H 'Content-Type: application/json' -d '{
 "text": "This is another child document"
}'
curl -XGET 'ES_HOST:ES_PORT/my_index/my_parent/_search?pretty' -H 'Content-Type: application/json' -d '{
 "query": {
   "has_child": {
     "type": "my_child",
     "query": {
       "match": {
         "text": "child document"
       }
     }
   }
 }
}'

Parent-child joins can be a useful technique for managing relationships when index-time performance is more important than search-time performance, but it comes at a significant cost. Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

Global Ordinals and Latency

Parent-child uses global ordinals to speed up joins. Regardless of whether the parent-child map uses an in-memory cache or on-disk doc values, global ordinals still need to be rebuilt after any change to the index.

The more parents in a shard, the longer global ordinals will take to build. Parent-child is best suited to situations where there are many children for each parent, rather than many parents and few children.

Global ordinals, by default, are built lazily: the first parent-child query or aggregation after a refresh will trigger building of global ordinals. This can introduce a significant latency spike for your users. You can use eager_global_ordinals to shift the cost of building global ordinals from query time to refresh time, by mapping the _parent field as follows:

curl -XPUT 'ES_HOST:ES_PORT/company -d ‘{
  "mappings": {
    "branch": {},
    "employee": {
      "_parent": {
        "type": "branch",
        "fielddata": {
          "loading": "eager_global_ordinals"
        }
      }
    }
  }
}’

Here, Global ordinals for the _parent field will be built before a new segment becomes visible to search.

With many parents, global ordinals can take several seconds to build. In this case, it makes sense to increase the refresh_interval so that refreshes happen less often and global ordinals remain valid for longer. This will greatly reduce the CPU cost of rebuilding global ordinals every second.

Multi Generations

The ability to join multiple generations (see Grandparents and Grandchildren) sounds attractive until you think of the costs involved:

  • The more joins you have, the worse performance will be.

  • Each generation of parents needs to have their string _id fields stored in memory, which can consume a lot of RAM.

As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:

  • Use parent-child relationships sparingly, and only when there are many more children than parents.

  • Avoid using multiple parent-child joins in a single query.

  • Avoid scoring by using the has_child filter, or the has_child query with score_mode set to none.

  • Keep the parent IDs short, so that they compress better in doc values, and use less memory when transiently loaded.

The conclusion is that documents should be modeled so that search-time operations are as cheap as possible. In particular, joins should be avoided. nested can make queries several times slower and parent-child relations can make queries hundreds of times slower. So if the same questions can be answered without joins by denormalizing documents, significant speedups can be expected.

Give Memory to the Filesystem Cache

When running Elasticsearch, memory is one of the key resources you’ll want to closely monitor. Elasticsearch and Lucene utilize all of the available RAM on your nodes in two ways: JVM heap and the file system cache. Elasticsearch runs in the Java Virtual Machine (JVM), which means that JVM garbage collection duration and frequency will be other important areas to monitor.

JVM heap

Elasticsearch stresses the importance of a JVM heap size that’s “just right”—you don’t want to set it too big, or too small, for reasons described below. In general, Elasticsearch’s rule of thumb is allocating less than 50 percent of available RAM to JVM heap, and never going higher than 32 GB.

The less heap memory you allocate to Elasticsearch, the more RAM remains available for Lucene, which relies heavily on the file system cache to serve requests quickly. However, you also don’t want to set the heap size too small because you may encounter out-of-memory errors or reduced throughput as the application faces constant short pauses from frequent garbage collections.

Elasticsearch’s default installation sets a JVM heap size of 1 gigabyte, which is too small for most use cases. You can export your desired heap size as an environment variable and restart Elasticsearch:   

export ES_HEAP_SIZE=10g

The other option is to set the JVM heap size (with equal minimum and maximum sizes to prevent the heap from resizing) on the command line every time you start up Elasticsearch:

ES_HEAP_SIZE="10g" ./bin/elasticsearch

In both of the examples shown, we set the heap size to 10 gigabytes. To verify that your update was successful, run:

curl -XGET http://ES_HOST:9200/_cat/nodes?h=heap.max

The output should show you the correctly updated max heap value.

Garbage collection

Elasticsearch relies on garbage collection processes to free up heap memory. As garbage collection uses resources (in order to free up resources!), you should keep an eye on its frequency and duration to see if you need to adjust the heap size. Setting the heap too large can result in long garbage collection times; these excessive pauses are dangerous because they can lead your cluster to mistakenly register your node as having dropped off the grid.

Thus, Elasticsearch heavily relies on the filesystem cache in order to make search fast. In general, you should make sure that at least half the available memory goes to the filesystem cache so that elasticsearch can keep hot regions of the index in physical memory.

Use Faster Hardware

If your search is I/O bound, you should investigate giving more memory to the filesystem cache (see above) or buying faster drives. In particular SSD drives are known to perform better than spinning disks. Always use local storage, remote filesystems such as NFS or SMB should be avoided. Also beware of virtualized storage such as Amazon’s Elastic Block Storage. Virtualized storage works very well with Elasticsearch, and it is appealing since it is so fast and simple to set up, but it is also unfortunately inherently slower on an ongoing basis when compared to dedicated local storage. If you put an index on EBS, be sure to use provisioned IOPS otherwise operations could be quickly throttled.

If your search is CPU-bound, you should investigate buying faster CPUs.

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus