In this blog post, we will cover an important feature, the filtering of values with partitions in terms aggregation, which can also be used to navigate through the unique terms in the buckets of terms aggregation.

Prior to 5.2, there was an option to put zero as value for size in terms aggregation and fetch all terms. But this approach was a failure because it could harm the main memory while downloading entire set of terms.

Note: Since version 5.0, Elasticsearch removed this feature where you can set the size in terms aggregation to zero and download entire unique terms set. The type of size was changed to int, which means that its maximum value was further reduced.

Let us have a quick look into using this feature:

Data

For this experiment, I am indexing 1000 documents using the following format:

You can find the sample data here.

{
 "id": 1,
 "first_name": "Wayne",
 "last_name": "Simmons",
 "email": "wsimmons0@aol.com",
 "country": "Russia"
}

As you can see, the document provides with a basic personal information such as email and country.

Mapping

Since this experiment is simple, we will create a single-shard index as in the example below:

curl -XPUT 'http://localhost:9200/test-index-01' -d '{
 "settings": {
   "index": {
     "number_of_replicas": "1",
     "number_of_shards": "1"
   }
 }
}'

In this tutorial, we are going to aggregate on the field "country". Unlike in the previous versions of Elasticsearch, we don's have to index the data, with the field "country" as "not_analyzed". Elasticsearch 5.x and higher will take care of this operation internally. All you have to do is to index your data normally, and when using the aggregation, choose the field name "country.keyword", so that the not_analyzed field values would be returned for that field. This is a convenient change for us, as it makes the keyword indexing a simple thing. You can read more about it here.

Terms Aggregation and Navigation through Buckets Prior to 5.2 Release

Prior to the 5.2 Elasticsearch release, there was no way to navigate through the buckets of a terms aggregation. In order to understand this, we need to go in detail of how the terms aggregation results are calculated. 

For a terms aggregation query to Elasticsearch, the query is run in all available shards of that particular index/indices. And against each shard, the query is run and the results are calculated individually. Then the top x (the size of the return buckets we specify, defaults to 10) would be taken from each shard and then combined to get x unique results.

This method has a flaw. What if one of the terms ends up in the first position in one shard, and on or after x+1th position in another shard? The later shard's result would be omitted. That would result in incorrect numbers for that particular term. If there are nested aggregations present, the same happens. 

Also, when the shard size is significantly less than the number of unique terms occurring for the field (or cardinality), the margin of error becomes huge. This was the primary reason for the omission of pagination from the terms aggregations in the previous versions of Elasticsearch.

As a work around for such a scenario, we can specify the size of buckets to be zero or make use of exclude filters to achieve close to the pagination through buckets.

The issue of incorrect term numbers is explained in the diagram below:


query-for-terms-agg.png#asset:1567

Terms Aggregation and Navigation in Elasticsearch 5.2 release

The use cases which involve too many unique terms to be processed has become common, the above methodologies of specifying size zero, giving higher shard_size values, using "exclude" all got too cumbersome. This is the primary motivation for the Elasticsearch team to implement a method, so as to browse through the unique terms, without being heavier than the previous methods, and also at the same time accurate. 

This involves the computation of the unique field values for a field at the query time. Then the results are partitioned into a specified number, say "n". With each request coming, there would be request for the results of a specific partition from the "n" partitions, and those particular results would be given back as the response.

Let us see a use case on how the above partitioning is working. After we have finished indexing the document set provided, we will run the following query:

 curl -XPOST localhost:9200/test-index-01/mytype/_search -d '{
 "size": 0,
 "aggs": {
   "uniqueValueCount": {
     "cardinality": {
       "field": "country.keyword"
     }
   },
   "country": {
     "terms": {
       "field": "country.keyword",
       "include": {
         "partition": 0,
         "num_partitions": 5
       },
       "size": 35
     }
   }
 }
}‘

In the above query we can see two aggregations, the "uniqueValueCount" and the "country" respectively. The "uniqueValueCount" is a cardinality aggregation done on the field "country" in order for us to obtain the number of unique terms that are occurring in the aggregation.

As the first step, we get the number of unique terms occurring via the cardinality aggregation. This, in our case, is 122. Now it is our turn to decide how many partitions we want. Let us go for 5. This makes the average number of documents in a partition to be around 24 (number of results/number off partitions = 122/5 =24).

Now that we have an idea of how many results can be in a single partition, we opt for the size parameter. Let us give a margin near 10, and make it 35. Then apply the query with various values of partition from 0 to 4. Partition value cannot be equal to that of num_partitions. In this case a partition value of 5 would make the request search for a 6th partition which is obviously not there and hence we receive an error.

By analyzing the results by totaling the number of aggregation buckets for each request, we can see that it totals upto 122. When it comes to large data sets, careful choice of "num_partitions" and "size" parameters can do a lot of time/resource saving.

Conclusion

In this tutorial we have discussed browsing through unique terms. We have also seen what is inadequate with the traditional terms aggregation which led to the inclusion of the above feature.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.