This is a continuation of our extensive blog series on Elasticsearch scripting, which includes tutorials and example scripts for sorting, filtering, and scoring. In our previous article, we went through a basic tutorial on performing aggregations in Elasticsearch using scripts.

In this tutorial we move on to more advanced operations: computing term frequencies, reshaping the results of extended_stats aggregations, and implementing scripted_metric aggregations.

Modeling the Data

The examples in this article use the documents given below, which contain details of a single product type sold by different companies. The data includes company name, product name, product price, the markets in which the product is sold, units available for sales, and a shipment indicator. We begin by indexing these documents under the index named sales and a type named profile.

Document 1

curl -XPOST 'http://localhost:9200/sales/profile/1' -d '{
    "companyName": "Nestle",
    "productType" : " milk",
    "markets" :["US","India","China"],
    "price": 20,
    "units":1400,
    "shipped": yes
}'

Document 2

curl -XPOST 'http://localhost:9200/sales/profile/2' -d '{
    "companyName": "Knor",
    "productType": "milk",
    "markets": ["US","Korea","France"],
    "price" : 15,
    "units":1200,
    "shipped": yes
}'

Document 3

curl -XPOST 'http://localhost:9200/sales/profile/3' -d '{
    "companyName": "Britannia",
    "productType": "milk",
    "markets": ["Portugal","India","Spain"],
    "price": 15,
    "units":1000,
    "shipped": yes
}'

Modifying “extended_stats” Metrics Values using Scripts

The Elasticsearch extended_stats aggregation is an extension of its stats aggregation. It provides us with a number of additional metrics including the count of the number of values for a specific field, maximum value, minimum value, variance, and standard deviation.

For simplicity, our index contains just three documents that represent only three companies. However, you can easily imagine a real-world scenario in which we have numerous companies and many types of products for each company. There would be many documents having the same value for companyNAME but with a wide range of products.

In such cases, it might be necessary to modify one or more values that bubble up in the aggregation. Let’s say that we need to focus on a specific price value that is found within the extended_stats metrics. How do we accomplish this?

Here’s a query that will classify the data according to companyName while changing the price field in the extended_stats metrics:

curl -XGET 'http://localhost:9200/sales/profile/_search?&pretty=true&size=3' -d '{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "company_aggs": {
      "terms": {
        "field": "companyName",
        "order": {
          "_count": "desc"
        }
      },
      "aggs": {
        "price_modify": {
          "extended_stats": {
            "field": "price",
            "script": "_value == 20 ? 1 : 0"
          }
        }
      }
    }
  }
}'

Did you notice that this a nested aggregation? The company_aggs aggregation will aggregate the documents according to the field companyName, in the descending order of the doc_counts. The price_modify aggregation, which is inside company_aggs, will perform an extended_stats aggregation on the price field.

The script that is part of the price_modify aggregation checks the value of the price field to see if it has a value of 20. If so, it assigns a value of 1 to the price field within extended_stats metrics; otherwise it assigns a value of 0.

Calculating the Term Frequency for a Specific Value within a Field

As we explain in our first article on aggregations with scripts, we can use terms aggregation to get the count of the number of documents that contain a specific field. If we want to know the total number of occurrences for a particular field within an entire index, we can use the sum aggregation.

Let’s say that we need to perform an aggregation on the data in our index according to the markets field. More specifically, we need to know the number of cases in which this field contains the string “India” along with at least one other market. The aim is to learn how many companies are selling in at least one other market in additional to India.

curl -XGET 'http://localhost:9200/sales/profile/_search?&pretty=true&size=3' -d '{
 "query": {
    "match_all": {}
  },
  "aggs": {
    "markets_agg": {
      "terms": {
        "field": "markets"
      },
      "aggs": {
        "paired_agg": {
          "sum": {
            "script": "_index[\"markets\"]['india'].tf()"
          }
        }
      }
    }
  }
}'

After running this query, we see that the top-level of this nested aggregation is markets_agg, which gives us the document counts according to the occurrences of distinct values in the markets field. The inside-level aggregation is paired_agg, in which we employ a script to give us the term frequency of the word “india.” That term frequency value is then used in the buckets of the top-level aggregation. If it returns a value of 0 for any of the other bucketed countries, then the specific bucketed country and India have not been found in the value of the same markets field.

We find in our results that the “US” key within the markets_agg bucket has a value of 1 for its paired_agg, the value 1. This tells us that at least one company has at least one other market in addition to India. If we need to get the details for any of the companies having this combination, we can simply employ the top hits aggregation to fetch that info for us.

Scripted_metric Aggregations

Scripted_metric is a metric aggregation that uses scripts to provide metric output. Most importantly, this gives us the freedom to define our own aggregations. How might we use it in our context here?

Looking again at our documents above, we now shift our focus to the three fields in these documents: productTypeprice, and units. We also notice that there is only one type of product, “milk.” Suppose we need to calculate the total revenue for this product type. What would you do?

Our approach is to multiply the per-unit price field together with the value in the units field in each document, and then sum all of those results. Here’s how we do that using scripts:

curl -XGET 'http://localhost:9200/sales/profile/_search?&pretty=true&size=3' -d '{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "expected_revenue": {
      "scripted_metric": {
        "init_script": "_agg[\"tempArray\"] = [];",
        "map_script": "if (doc.productName.value == \"milk\") { _agg.tempArray.add(doc.price.value*doc.units.value); } ",
        "combine_script": "exRevenue = 0; for (i in _agg.tempArray) { exRevenue += i }; return exRevenue;",
        "reduce_script": "exRevenue = 0; for (j in _aggs) { exRevenue += j }; return exRevenue;"
      }
    }
  }
}'

Walking through the query above, we see that the name of the aggregation is expected_income , and we specify (to Elasticsearch) that this is a scripted_metric aggregation. Also notice that there are actually four script parameters. However, we actually only require init_script and map_script to make this work. We’ll explain more about that after defining each of the parameters:

  • init_script — As indicated by the _agg construct, we initialize an array having the name tempArray in our aggregation object.
  • map_script — Here we check for a specific condition or match. This is where we check whether the productType field value is “milk,” and if this condition is met, we push the results of the multiplied values in the price and units fields into the tempArray.
  • combine_script — After running map_script, we get an aggregation structure containing a collection of tempArrays and then iteratively consolidate all values from all of those arrays into a single exRevenue array.
  • reduce_script — In this secondary iteration, we add up all of the elements in the exRevenue array to get a total sum of the elements in a single field.

In the results, we find that the total revenue (expected_revenue) from sales of the milk product type is 61,000.

We encourage you to experiment, omitting the combine_script and reduce_script parameters and examining those results. Then add them back (one-by-one) to the query and examine how those results differ yet again. This will help you to get a solid grasp of the scripted_metrics aggregation type.

Conclusion

This article demonstrates several more illustrative applications of aggregations with scripting, using mockups of sales data for a few companies. However, you can multiply and modify the documents given here to more closely simulate and solve your own real-world development problems.

We’re always happy to hear how this information has been helpful, so we welcome your feedback in the comments section below.