In this blog post, we will cover an important feature, the filtering of values with partitions in terms aggregation, which can also made used to navigate through the unique terms in the buckets of terms aggregation.

Prior to 5.2, there was an option to put zero as value for size in terms aggregation and fetch all terms. But this approach was a failure because it could harm the main memory while downloading entire set of terms.

Note: From version 5.0, Elasticsearch removed this feature where you can give zero to size and download entire unique terms set. The type of size was changed to int, which means that its maximum value was further reduced.

Let us have a quick look into how to use this feature:


For this experiment, I am indexing 1000 documents with the following format to demonstrate the feature. 

You can find the sample data here.

 "id": 1,
 "first_name": "Wayne",
 "last_name": "Simmons",
 "email": "",
 "country": "Russia"

As you can see from above, the document gives us basic information regarding a person, like his email and country.


As this is a very simple and plain data indexing experiment, we shall make this as a single sharded index, like below.

curl -XPUT '<a href="http://localhost:9200/test-index-01">http://localhost:9200/test-index-01</a>' -d '{
 "settings": {
   "index": {
     "number_of_replicas": "1",
     "number_of_shards": "1"

In this tutorial we are going to aggregate on the field "country". Unlike the previous versions of Elasticsearch, we need not index the data, with the field "country" as "not_analyzed", as Elasticsearch 5.x would take care of it internally. All you have to do is index the data normally, and when using the aggregation, use the field name "country.keyword", so that the not_analyzed field values would be returned for that field. This is a convenient change for us, as it makes the keyword indexing a simple affair. You can read more about it here.

Terms Aggregation and Navigation through Buckets Prior to 5.2 Release

Prior to the 5.2 elasticsearch release, there was no way to navigate through the buckets of a terms aggregation. In order to understand this, we need to go in detail of how the terms aggregation results are calculated. 

For a terms aggregation query to elasticsearch, the query is run in all the available shards of that particular index/indices. And against each shard, the query is run and the results are calculated individually. Then the top x (the size of the return buckets we specify, defaults to 10), would be taken from each shard and then combined to get x unique results.

This method has a flaw. What if one of the terms ends up in the first position in one shard, and on or after x+1th position in another shard? The later shard's result would be omitted. This results in incorrect numbers for that particular term. If there are nested aggregations present, the same happens. 

Also, when the shard size is significantly less than the number of unique terms occurring for the field (or cardinality), the margin of error becomes huge. This was the primary reason for the omission of pagination from the terms aggregations from the previous versions of elasticsearch.

As a work around when such use cases arrived, we specified the size of buckets to be zero and got all the results and handled that, or make use of exclude filters to achieve close to the pagination through buckets.

The issue of incorrect term numbers is explained in the below diagram:


Terms Aggregation and Navigation in Elasticsearch 5.2 release

The use cases which involve too many unique terms to be processed has become common, the above methodologies of specifying size zero, giving higher shard_size values, using "exclude" all got too cumbersome. This is the primary motivation for the Elasticsearch team to implement a method, so as to browse through the unique terms, without being heavier than the previous methods, and also at the same time accurate. 

This involves the computation of the unique field values for a field at the query time. Then the results are partitioned into a specified number, say "n". With each request coming, there would be request for the results of a specific partition from the "n" partitions, and those particular results would be given back as the response.

Let us see a use case on how the above partitioning is working. After we have finished indexing the document set provided, we will run the following query:

 curl -XPOST localhost:9200/test-index-01/mytype/_search -d '{
 "size": 0,
 "aggs": {
   "uniqueValueCount": {
     "cardinality": {
       "field": "country.keyword"
   "country": {
     "terms": {
       "field": "country.keyword",
       "include": {
         "partition": 0,
         "num_partitions": 5
       "size": 35

In the above query we can see two aggregations, the "uniqueValueCount" and the "country" respectively. The "uniqueValueCount" is a cardinality aggregation done on the field "country" in order for us to obtain the number of unique terms that are occurring in the aggregation.

As the first step, we get the number of unique terms occurring via the cardinality aggregation. This, in our case, is 122. Now it is our turn to decide how many partitions we want. Let us go for 5. This makes the average number of documents in a partition to be around 24 (number of results/number off partitions = 122/5 =24).

Now that we have an idea of how many results can be in a single partition, we opt for the size parameter. Let us give a margin near 10, and make it 35. Then apply the query with various values of partition from 0 to 4. Partition value cannot be equal to that of num_partitions. In this case a partition value of 5 would make the request search for a 6th partition which is obviously not there and hence we receive an error.

By analyzing the results by totaling the number of aggregation buckets for each request, we can see that it totals upto 122. When it comes to large data sets, careful choice of "num_partitions" and "size" parameters can do a lot of time/resource saving.


In this tutorial we have discussed browsing through unique terms. We have also seen what is inadequate with the traditional terms aggregation which led to the inclusion of the above feature.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus