This is a continuation of our long-running blog series on Elasticsearch scripting, which includes tutorials and example scripts for sorting, filtering, and scoring. In this article, we move on to various scripting options that are available for managing ES aggregations.

A developer often doesn't get the expected results when using default aggregations. There are also limitations with the basic aggregation features. This is the case, for example, if we want to alter the offset values for a histogram. Since Elasticsearch doesn't provide this native capability, we use scripts to get the results we want. We also cover several other aggregation tasks using scripts.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Modeling the Data

To support the examples below, we provide a document set containing details for employees of a fictitious company. We include the data for each employee, including the name, age, position, and salary. We create an <strong>employee</strong> index containing these employee documents having a <strong>profile</strong> type.

Document 1:

curl -XPOST 'http://localhost:9200/employee/profile/1' -d '[
    "name": "Bob",
    "age": 35,
    "about": "Bob joined the company as a full time technology consultant in the year 2012",
    "position": "consultant",
    "salary": 5000,
    "experience": "3-years",
    "married": 1,
    "fullTime": true
]'

Document 2:

curl -XPOST 'http://localhost:9200/employee/profile/2' -d '{
    "name": "Jack",
    "age": 30,
    "about": "Jack joined the company as a part time management consultant in the year 2013",
    "position": "Management consultant",
    "salary": 3000,
    "experience": "3-years",
    "married": 0,
    "fullTime": false
}'

Document 3:

curl -XPOST 'http://localhost:9200/employee/profile/3' -d '{
    "name": "Tom",
    "age": 33,
    "about": "Tom is serving as the operations manager of the firm from the year 2011",
    "position": "Operations manager",
    "salary": 7000,
    "experience": "7-years",
    "married": 1,
    "fullTime": true
}'

For brevity in this tutorial, we only index three documents. Of course, you can change the values in these examples and index more documents.

Using Scripts to Change Default Histogram Values

Suppose that our supervisor needs a histogram so that she can learn how many employees fall into each specified salary interval (bin). The histogram should divide into intervals of 3,000 dollars. We now have the interval and the data to perform the histogram aggregation. There is a problem with such an aggregation: since the interval is 3,000, a simple breakdown would result in divisions at 3,000, 6,000, 9,000, and so on.

Our supervisor clarifies, and tell us that we need to break down the data such that we learn who is earning a salary that ranges from 0-3,000, then 3,000-6,000, and so on. This is an offset manipulation of histogram values, and it cannot be done using the normal Elasticsearch aggregation feature. Of course, the point of this article is that we can indeed use scripting to accomplish this task.

Here is a query that can help us:

curl -XGET 'http://localhost:9200/employee/profile/_search?&pretty=true&size=3' -d '{
    "query": {
        "match_all": {}
    }, 
    "aggs": {
        "histogramData": {
            "histogram": {
                "field": "salary",
                "interval": 3000,
                "script": "_value + 2000"
            }     
        }   
    }
}'

This script adds the value 2,000 to the default offset value, and it then calculates the interval according to the value we have given (3,000). This would extend the offset value to 8,000. Since the interval step size is 3,000, the script will calculate the last historgram interval to be 6000-9000. Now we can perform a clear division of the employees according to the required intervals.

Using Scripts to Break Values in Fields

Next we'll extract only a specific piece of data from a specific field for aggregation. The documents in the index contain the field experience, which has values of the form "x-years" (where "x" is a number).

If we attempt to do a normal aggregation, the buckets that we get will have names such as "3-years," "4-years," and "7-years." Let's say, however, that we need the bucket names to be "3," "4," and "7." This can be done by breaking the value at the character "-" and then using only the first element after the break.

Before we do that, we need to delete the index and recreate it. This is because Elasticsearch will automatically map the data types for each field. If we want to break a field, we need to specify the mapping for that field not_analyzed.

Delete the index with this command:

curl -XDELETE "http://localhost:9200/employee"

Now, we create a new index with the same name:

curl -X PUT "http://hostname:9200/employee"

Next, specify in the mapping that the field experience is to be not_analyzed:

curl -X PUT "http://localhost:9200/employee/profile/_mapping" -d '{
    "profile":{
        "properties":{
            "experience":{
                "type":"string",
                "index": "not_analyzed"
            }
        }
    }
}'

The field has been re-mapped so that we can run our script that breaks and analyzes the experience field. The script for our aggregation is as follows:

curl -XGET 'http://localhost:9200/employee/profile/_search?&pretty=true&size=3' -d '{  
    "query": {    
        "match_all": {}  
    },  
    "aggs": {    
        "urls": {      
            "terms": {        
                "field": "experience",        
                "script": "_value.split('-')[0]"      
            }    
        }  
    }
}'

After running this script, we'll see that the values from the experience field are split and only the first character is taken to be the bucket key name.

Using Scripts to Perform Terms Aggregation on Multiple Fields

When using terms aggregation, we may get more benefit by performing the aggregation on multiple fields. Let's say that we want do a terms aggregation on the field about. Default terms aggregation will gives us only the document counts of the top terms. We might also need to perform another terms aggregation in the field position, which would return the document counts of the top terms for that field. Taking this example one step further, we can see how we might need a terms aggregation on both fields, which is important for cases in which we need both of the aggregations results under the same buckets.

There is no such option available to us in Elasticsearch terms aggregation. So let's give it a try using a script, which is actually rather easy. Here's how we can effect a terms aggregations on the about and position fields.

curl -XGET 'http://localhost:9200/employee/profile/_search?&pretty=true' -d '{  
    "aggs": {    
        "union_demo": {      
            "terms": {        
                "size": 30,        
                "script": "doc['about'].values + doc['position'].values"      
            }    
        }  
    }
}'

Notice here that we have given a size parameter that is set to a value of 30. We do this because the number of buckets would be greater than 10 since the about field contains many words. The Elasticsearch terms aggregation would display only 10. This query in this script will show us the union of terms that aggregate from both fields.

Conclusion

In this scripting tutorial, we've seen how to achieve several types of aggregations that are not possible with native Elasticsearch features.

We've gone through the manipulation of offset values in the histogram aggregations, splitting the value in a specific field, as well as how to do terms aggregation on multiple fields. All of this was done with scripting.

In the next article, we shall proceed into advanced scripting with more elaborate types of aggregations.

We trust that this information has been helpful to you, and we welcome your feedback in the comments section below.

comments powered by Disqus