This post is part 2 of a 3-part series about tuning Elasticsearch Search Tuning. Part 1 can be found here. The aim of this tutorial is to further discuss Search Tuning techniques, strategies and recommendations specific to Elasticsearch 5.0 or onwards.

Kibana, Logstash, Beats, Elasticsearch - are all version 5.0 now. It is the fastest, safest, most resilient, easiest to use version of Elasticsearch ever, and it comes with a boatload of enhancements and new features.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

We have already discussed the “The Authoritative Guide to Elasticsearch Performance Tuning” in a three part tutorial series to introduce some general tips and methods for performance tuning, explaining at each step the most relevant system configuration settings and metrics. The tutorials already covered in the same context are as follows:

The indexing decisions are quite important, and they have big impact on how you can search your data. If it's a string field, should it be tokenized and normalized? If so, how? If a numeric field, what precision is required? There are many more field types like date-time fields, geospatial shapes, and parent/child relationships that require special care.

We have already discussed the “How to Maximize Elasticsearch Indexing Performance” in a three part tutorial series to introduce some general tips and methods to achieve maximum indexing throughput and reduce monitoring and management load. The tutorials already covered in the same context are as follows:

Lets move ahead and discuss further some Search Tuning techniques, strategies and recommendations specific to Elasticsearch 5.0 or onwards.

Pre-Index Data

We should leverage patterns in our queries to optimize the way data is indexed. For instance, if all your documents have a price field and most queries run range aggregations on a fixed list of ranges, you could make this aggregation faster by pre-indexing the ranges into the index and using a terms aggregations.

For instance, if documents look like:

curl -XPUT 'ES_HOST:ES_PORT/index/type/1?pretty' -H 'Content-Type: application/json' -d '{
 "designation": "bowl",
 "price": 13
}'

and search requests look like:

curl -XGET 'ES_HOST:ES_PORT/index/_search?pretty' -H 'Content-Type: application/json' -d '{
 "aggs": {
   "price_ranges": {
     "range": {
       "field": "price",
       "ranges": [
         { "to": 10 },
         { "from": 10, "to": 100 },
         { "from": 100 }
       ]
     }
   }
 }
}'

Then documents could be enriched by a price_range field at index time, which should be mapped as a keyword:

curl -XPUT 'ES_HOST:ES_PORT/index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "type": {
     "properties": {
       "price_range": {
         "type": "keyword"
       }
     }
   }
 }
}'
curl -XPUT 'ES_HOST:ES_PORT/index/type/1?pretty' -H 'Content-Type: application/json' -d '{
 "designation": "bowl",
 "price": 13,
 "price_range": "10-100"
}'

And then search requests could aggregate this new field rather than running a range aggregation on the price field.

curl -XGET 'ES_HOST:ES_PORT/index/_search?pretty' -H 'Content-Type: application/json' -d '{
 "aggs": {
   "price_ranges": {
     "terms": {
       "field": "price_range"
     }
   }
 }
}'

Mappings

The fact that some data is numeric does not mean it should always be mapped as a numeric field. Typically, fields storing identifiers such as an ISBN or any number identifying a record from another database, might benefit from being mapped as keyword rather than integer or long.

Keyword datatype is to index structured content such as email addresses, hostnames, status codes, zip codes or tags.

They are typically used for filtering (Find me all blog posts where status is published), for sorting, and for aggregations. Keyword fields are only searchable by their exact value.

If you need to index full text content such as email bodies or product descriptions, it is likely that you should rather use a text field.

Below is an example of a mapping for a keyword field:

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d '{
 "mappings": {
   "my_type": {
     "properties": {
       "tags": {
         "type":  "keyword"
       }
     }
   }
 }
}'

Indexes imported from 2.x do not support keyword. Instead they will attempt to downgrade keyword into string. This allows you to merge modern mappings with legacy mappings. Long lived indexes will have to be recreated before upgrading to 6.x but mapping downgrade gives you the opportunity to do the recreation on your own schedule.

Avoid Scripts

In general, scripts should be avoided. If they are absolutely needed, you should prefer the painless and expressions engines.

Painless is a simple, secure scripting language designed specifically for use with Elasticsearch. It is the default scripting language for Elasticsearch and can safely be used for inline and stored scripts. For a detailed description of the Painless syntax and language features, see the Painless Language Specification.

Please refer to  “Painless Scripting in Elasticsearch” for in-depth guide on painless scripting language.

Lucene Expressions Language

Lucene’s expressions compile a javascript expression to bytecode. They are designed for high-performance custom ranking and sorting functions and are enabled for inline and stored scripting by default.

Performance

Expressions were designed to have competitive performance with custom Lucene code. This performance is due to having low per-document overhead as opposed to other scripting engines: expressions do more "up-front".

This allows for very fast execution, even faster than if you had written a native script.

Syntax

Expressions support a subset of javascript syntax: a single expression. See the expressions module documentation for details on what operators and functions are available.

Variables in expression scripts are available to access:

  • document fields, e.g. doc['myfield'].value

  • variables and methods that the field supports, e.g. doc['myfield'].empty

  • Parameters passed into the script, e.g. mymodifier

  • The current document’s score, _score (only available when used in a script_score)

We can use Expressions scripts for script_score, script_fields, sort scripts, and numeric aggregation scripts, simply set the lang parameter to expression.

Force-Merge Read-Only Indices

Indices that are read-only would benefit from being merged down to a single segment. This is typically the case with time-based indices: only the index for the current time frame is getting new documents while older indices are read-only.

The force merge API allows to force merging of one or more indices through an API. The merge relates to the number of segments a Lucene index holds within each shard. The force merge operation allows to reduce the number of segments by merging them.

This call will block until the merge is complete. If the http connection is lost, the request will continue in the background, and any new requests will block until the previous force merge is complete.

curl -XPOST 'ES_HOST:ES_PORT/twitter/_forcemerge?pretty'

The force merge API accepts the following request parameters:

  • max_num_segments - The number of segments to merge to. To fully merge the index, set it to 1. Defaults to simply checking if a merge needs to execute, and if so, executes it.

  • only_expunge_deletes - Should the merge process only expunge segments with deletes in it. In Lucene, a document is not deleted from a segment, just marked as deleted. During a merge process of segments, a new segment is created that does not have those deletes. This flag allows to only merge segments that have deletes. Defaults to false. Note that this won’t override the index.merge.policy.expunge_deletes_allowed threshold.

  • flush - Should a flush be performed after the forced merge. Defaults to true.

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus