The aggregations feature set is one of the most exciting and beneficial in the entire Elasticsearch offering, largely because it provides a very attractive alternative to facets.

In this tutorial, we explain aggregations in Elasticsearch and step through some examples. We compare metric and bucket aggregations and show how you can exploit aggregation nesting (which is not possible with facets). You're welcome to copy any and all of our example code throughout the article.

A Bit of Background on Facets

If you’ve ever used Elasticsearch facets, then you understand how useful they can be. After considerable experience, we're here to tell you that Elasticsearch aggregations are even better. Facets enable you to quickly calculate and summarize data that results from query, and you can use them for all sorts of tasks such as dynamic counting of result values or creating distribution histograms. Although facets are quite powerful, they have some limitations that relate to their implementation in the Elasticsearch core. Because facets perform their calculations only one-level deep, it isn't easy to combine them.

The aggregations API solves these problems, and it also provides an easy way of sculpting very precise multi-level calculations that occur at query time—within a single request. Simply put: Elasticsearch aggregations are facets on afterburner.

Setup

If you've never done an install and basic setup of Elasticsearch, we recommend that you invest 15 minutes to acquaint yourself with our Elasticsearch tutorial. After installing it, you can run any of the code that we provide in the examples below.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

This is the mapping and data that we will be using for the examples:

curl -XPUT "http://localhost:9200/sports/" -d'
{
   "mappings": {
      "athlete": {
         "properties": {
            "birthdate": {
               "type": "date",
               "format": "dateOptionalTime"
            },
            "location": {
               "type": "geo_point"
            },
            "name": {
               "type": "string"
            },
            "rating": {
               "type": "integer"
            },
            "sport": {
               "type": "string"
            }
         }
      }
   }
}'


The data:

curl -XPOST "http://localhost:9200/sports/_bulk" -d'
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"],  "location":"46.22,-68.45"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bob", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["3", "4"],  "location":"45.21,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Jim", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["3", "2"],  "location":"45.16,-63.58" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Joe", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Tim", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["3", "3"],  "location":"46.22,-68.85"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Alfred", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.12,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Jeff", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["2", "3"], "location":"46.12,-68.55"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Will", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["4", "4"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["3", "4"],  "location":"46.22,-68.45"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["1", "3"],  "location":"45.21,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Ray", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.16,-63.58" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["5", "2"],  "location":"46.22,-68.85"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["4", "2"],  "location":"45.12,-68.35"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["3", "2"], "location":"46.12,-68.55"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["5", "2"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["6", "4"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["10", "7"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }
'

Now let's get on with our tutorial.

Aggregations

We like to remember what Uri Boness says: "An aggregation is the result of an aggregation."

In many ways, aggregations are similar to facets, and the intention is to eventually replace facets altogether. From the documentation, we read that "facets are and should be considered deprecated and will likely be removed in one of the future major releases."

One of the major limitations is that you can't have facets of facets. Very simply, this means there's no way to nest facets. As we'll learn here in this article, the ability to nest aggregations brings a great deal of goodness that is entirely absent from facets.

There are several different types of aggregations. For those of you who use facets, some of this variation may look familiar. Some of the aggregation types behave similarly to their facet predecessors, such as terms aggregation. Others are entirely new, such as value count aggregation.

The two broad families of aggregations are metrics aggregations and bucket aggregations. Metrics aggregations calculate some value (such as an average) over a set of documents; bucket aggregations group documents into buckets. Before we get into the details, let's take a look at the general structure of aggregation requests.

Structure of an Aggregation

Aggregations requests will all have the same basic structure, as shown in the example below. The color coding helps with easy identification of the various elements.

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : { 
            <aggregation_body>
        },
        ["aggregations" : { [<sub_aggregation>]* } ]
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

The aggregations (you can also use aggs instead) object in the request json contains the aggregation name, type, and body. <aggregation_name> is a name that the user defines (without the brackets), and this name will uniquely identify the aggregation name/key in the response.

An <aggregation_type> is typically the first key within an aggregation. It may be a terms, stats, or geo-distance aggregation, but this is where it starts. Within our <aggregation_type> we have an <aggregation_body>. Within <aggregation_body> we specify the properties necessary for our aggregation. The available properties depend on the type of the aggregation.

You can optionally provide a sub aggregations to nest the results of one aggregations element into another aggregations element. In addition, you can input more than one aggregation (aggregation_name_2) in a query to have more separate top-level aggregations. Although there is no limit to the level of nesting, you cannot nest an aggregation inside a metric aggregation for reasons that will become apparent below. We'll get into the difference between bucket and metric aggregations after we look at the different kinds of values on which we can aggregate.

Values Source

Some aggregations use values taken from aggregated documents. These values can be taken from either the specified document field or a script that generates values with each document. The first example below gives a terms aggregation on the name field with an order on the sub-aggregation rating_avg value. As you can see, we use a nested metric aggregation to order the results of a bucket aggregation.

Although we use the index given above, we encourage you to run this query (and the others below). You can get direct results from your effort and then modify it to match your datasets.

Also, look closely to see that we include "size": 0, since our focus here is the aggregation results—not document results.

curl -XPOST "http://localhost:9200/sports/athlete/_search" -d'
{
   "size": 0, 
   "aggregations": {
      "the_name": {
         "terms": {
            "field": "name",
            "order": {
               "rating_avg": "desc"
            }
         },
         "aggregations": {
            "rating_avg": {
               "avg": {
                  "field": "rating"
               }
            }
         }
      }
   }
}'

We can also provide a script to generate the values used by the aggregation:

curl -XPOST "http://localhost:9200/sports/athlete/_search" -d'
{
   "size": 0,
   "aggregations": {
      "age_ranges": {
         "range": {
            "script": "DateTime.now().year - doc[\"birthdate\"].date.year",
            "ranges": [
               {
                  "from": 22,
                  "to": 25
               }
            ]
         }
      }
   }
}'

You can read more about value source fields and scripting in aggregations here. Remember that Elasticsearch scripting is an extensive subject area, and you can read more in our series on Elasticsearch Scripting.

Now, let's have a brief look at both metric and bucket aggregations.

Metric Aggregations

Metric aggregation types are for computing metrics for an entire set of documents. There are single-value metrics aggregations, such as avg, and there are multi-value metrics aggregations such as stats. A simple example of a metrics aggregation is the value_count aggregation, which simply returns the total number of values that have been indexed for a given field. To find the number of values in the "sport" field in our athlete data set, we could use the following query:

curl -XPOST "http://localhost:9200/sports/athlete/_search" -d'
{
   "size": 0,
   "aggs": {
      "sport_count": {
         "value_count": {
            "field": "sport"
         }
      }
   }
}'

Note that this will return the total number of values for that field, not the number of unique values. So in this case—since every document has a single-word value in the "sport" field—the result is simply equal to the number of documents in the index.

It's not possible to nest a metric aggregation inside of another metric aggregation, and it actually doesn't make any sense anyway. It can be very useful, however, to nest a metric aggregation inside of a bucket aggregation. We cover nesting in a section below, but we need to understand bucket aggregations before we get there.

Bucket Aggregations

Bucket aggregations are mechanisms for grouping documents. Each type of bucket aggregation has its own method of segmenting the document set. Perhaps the simplest type is the terms aggregation. This one functions very much like a terms facet, returning the unique terms indexed for a given field along with the number of matching documents. If we want to find all of the values in the "sport" field in our data set, we could use the following:

curl -XPOST "http://localhost:9200/sports/athlete/_search" -d'
{
   "size": 0,
   "aggregations": {
      "sport": {
         "terms": {
            "field": "sport"
         }
      }
   }
}'

We would get this response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 22,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "sport": {
         "buckets": [
            {
               "key": "baseball",
               "doc_count": 16
            },
            {
               "key": "football",
               "doc_count": 2
            },
            {
               "key": "golf",
               "doc_count": 2
            },
            {
               "key": "basketball",
               "doc_count": 1
            },
            {
               "key": "hockey",
               "doc_count": 1
            }
         ]
      }
   }
}

You may find that the geo_distance aggregation is more intriguing. Alhough it has a number of options, in the simplest case it takes an origin and a distance range and then calculates how many of the documents lie within the circle according to a given geo_point field.

Let's say that we need know how many of our athletes live within 20 miles from the geo-point "46.12,-68.55." We could use this aggregation:

curl -XPOST "http://localhost:9200/sports/athlete/_search" -d'
{
   "size": 0,
   "aggregations": {
      "baseball_player_ring": {
         "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
               {
                  "from": 0,
                  "to": 20
               }
            ]
         }
      }
   }
}'

We find that the answer is 14:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 22,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "baseball_player_ring": [
         {
            "key": "*-20.0",
            "from": 0,
            "to": 20,
            "doc_count": 14
         }
      ]
   }
}

Nesting Bucket Aggregations

Many developers would agree that the most powerful aspect of bucket aggregations is the ability to nest them. You can define a top-level bucket aggregation and, inside of it, define a second-level aggregation that operates on each resulting bucket. This nesting can go as many levels deep as you require.

Continuing with our example, we can further segment the results of our geo_distance aggregation, using a nested range aggregation on age (calculated from "birthdate" with a script). Suppose we want to know how many of the athletes (who live within the circle we define in the previous section) fall within each of two age categories. We can use the following aggregation to get this information:

curl -XPOST "http://localhost:9200/sports/athlete/_search" -d'
{
   "size": 0,
   "aggregations": {
      "baseball_player_ring": {
         "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
               {
                  "from": 0,
                  "to": 20
               }
            ]
         },
         "aggregations": {
            "ring_age_ranges": {
               "range": {
                  "script": "DateTime.now().year - doc[\"birthdate\"].date.year",
                  "ranges": [
                      { "from": 20, "to": 25 },
                      { "from": 25, "to": 30 }
                  ]
               }
            }
         }
      }
   }
}'

The response would be:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 22,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "baseball_player_ring": [
         {
            "key": "*-20.0",
            "from": 0,
            "to": 20,
            "doc_count": 14,
            "ring_age_ranges": [
               {
                  "from": 20,
                  "to": 25,
                  "doc_count": 4
               },
               {
                  "from": 25,
                  "to": 30,
                  "doc_count": 10
               }
            ]
         }
      ]
   }
}

Now let's compute some statistics on our inner-most results using stats—a multi-value metrics aggregator. For the athletes who live within our circle, and for each of the two age groups, we now want to calculate statistics on the "rating" field from the resulting documents:

curl -XPOST "http://localhost:9200/sports/athlete/_search" -d'
{
   "size": 0,
   "aggregations": {
      "baseball_player_ring": {
         "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
               {
                  "from": 0,
                  "to": 20
               }
            ]
         },
         "aggregations": {
            "ring_age_ranges": {
               "range": {
                  "script": "DateTime.now().year - doc[\"birthdate\"].date.year",
                  "ranges": [
                      { "from": 20, "to": 25 },
                      { "from": 25, "to": 30 }
                  ]
               },
               "aggregations": {
                  "rating_stats": {
                     "stats": {
                        "field": "rating"
                     }
                  }
               }
            }
         }
      }
   }
}'

We get a response containing the computed statistics that we're seeking:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 22,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "baseball_player_ring": [
         {
            "key": "*-20.0",
            "from": 0,
            "to": 20,
            "doc_count": 14,
            "ring_age_ranges": [
               {
                  "from": 20,
                  "to": 25,
                  "doc_count": 4,
                  "rating_stats": {
                     "count": 7,
                     "min": 2,
                     "max": 5,
                     "avg": 2.857142857142857,
                     "sum": 20
                  }
               },
               {
                  "from": 25,
                  "to": 30,
                  "doc_count": 10,
                  "rating_stats": {
                     "count": 16,
                     "min": 2,
                     "max": 10,
                     "avg": 6.375,
                     "sum": 102
                  }
               }
            ]
         }
      ]
   }
}

As you can see, you can create a grand scheme of buckets containing buckets that hold more buckets. You can also get metrics on each of the buckets -- and on and on, to whatever level of complexity is necessary. From these simple building blocks, you can gain deep and complex insights from your data using nested aggregations.


Editor's note: This is update to an article that was written in January 2014.


Qbox blog: Subscribe to our blog to get alerts for upcoming blog posts by simply entering your email address in the right sidebar and clicking the Subscribe button.

Questions? Just drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create a free account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.


comments powered by Disqus