In this post we cover Elasticsearch Aggregations. Elasticsearch Aggregation API’s main task allows you to summarize, calculate, and group the data in near real time. An important feature here is that our aggregations can implement sub-aggregations, which in turn implement more sub-aggregations. You can implement these sub-aggregations as much as needed. You can use aggregations in a variety of actions, from building analytical reports, to getting real-time analysis of data and taking quick actions. This allows for a flexible API.

The aggregation functionality is completely different from search and enables you to ask sophisticated questions to the data. Let’s practice with this tool.

Note: If you have no experience with aggregations in Elastic, please read this article first.

Dataset

First, we need data for analysis. Here is a set of data on unemployment, income, and population by states and regions of the US. Copy code from this gist and run it.

Now, you will have this index:

curl -XGET "http://localhost:9200/states/state/_mapping?pretty"
{
  "states": {
 "mappings": {
   "state": {
     "properties": {
       "abbrev": {
         "type": "string"
       },
       "income": {
         "type": "long"
       },
       "name": {
         "type": "string"
       },
       "population": {
         "type": "long"
       },
       "region": {
         "type": "string"
       },
       "unemprate": {
         "type": "double"
       }
     }
   }
 }
  }
}

Simple Aggregation

Let’s analyze the information that we have.

First, we calculate how many US regions each state contains. For this will use terms bucket aggregation:

curl -XGET "http://localhost:9200/states/state/_search?pretty" -d'
{
 "size" : 0,
 "aggs" : {
     "states_by_region" : {
         "terms" : {
           "field" : "region"
         }
     }
 }
}'

Elastic will group all the data in our index in four groups, on the discharge for each US region.

Response:

 {  …
 "aggregations": {
 "states_by_region": {
   "doc_count_error_upper_bound": 0,
   "sum_other_doc_count": 0,
   "buckets": [
     {
       "key": "south",
       "doc_count": 17
     },
     {
       "key": "west",
       "doc_count": 13
     },
     {
       "key": "midwest",
       "doc_count": 12
     },
     {
       "key": "northeast",
       "doc_count": 9
     }
   ]
 }
   }
}

Sub Aggregation

Let’s complicate the request by counting how many people are in the US, as well as in each region.

Here we need the nested aggregation. In each bucket we have received via the states_by_region aggregation, we apply metric aggregation sum. For the total population, apply the sum aggregation for all documents via the total_population aggregation. Also, add order by region_population:

Learn How To Install Ghost Node JS Blog With Docker On Our New Open Source Software Supergiant >

curl -XGET "http://localhost:9200/states/state/_search?pretty" -d'
{
 "size" : 0,
 "aggs" : {
     "states_by_region" : {
         "terms" : {
           "field" : "region",
           "order": {
             "region_population" : "desc"
           }
         },
         "aggs": {
             "region_population": {
                "sum": {
                   "field": "population"
                }
             }
          }
    },
    "total_population" : {
       "sum": {
         "field": "population"
       }
     }
 }
}'

Aggregation Diagram Response:

{ ….
  "aggregations": {
 "total_population": {
   "value": 313914040
 },
 "states_by_region": {
   "doc_count_error_upper_bound": 0,
   "sum_other_doc_count": 0,
   "buckets": [
     {
       "key": "south",
       "doc_count": 17,
       "region_population": {
         "value": 117257221
       }
     },
     {
       "key": "west",
       "doc_count": 13,
       "region_population": {
         "value": 73579431
       }
     },
     {
       "key": "midwest",
       "doc_count": 12,
       "region_population": {
         "value": 67316297
       }
     },
     {
       "key": "northeast",
       "doc_count": 9,
       "region_population": {
         "value": 55761091
       }
     }
   ]
 }
  }
}

Histogram, Order, and Stats

Let’s check whether there is a relationship between income and population of the state.

Histogram aggregation will divide the states into intervals according to the population, in increments of 5 million, and check the average income:

curl -XGET "http://localhost:9200/states/state/_search?pretty" -d'
{
  "size": 0,
  "aggs": {
 "income_by_population_group": {
   "histogram": {
     "field": "population",
     "interval": 5000000,
     "min_doc_count": 1
   },
   "aggs": {
     "avg_income": {
       "avg": {
         "field": "income"
       }
     }
   }
 }
  }
}'

Response:

{ ….
   "aggregations": {
 "income_by_population_group": {
   "buckets": [
     {
       "key": 0,
       "doc_count": 29,
       "avg_income": {
         "value": 49745.34482758621
       }
     },
     {
       "key": 5000000,
       "doc_count": 15,
       "avg_income": {
         "value": 53184.333333333336
       }
     },
     {
       "key": 10000000,
       "doc_count": 3,
       "avg_income": {
         "value": 49737
       }
     },
     {
       "key": 15000000,
       "doc_count": 2,
       "avg_income": {
         "value": 49772.5
       }
     },
     {
       "key": 25000000,
       "doc_count": 1,
       "avg_income": {
         "value": 49392
       }
     },
     {
       "key": 35000000,
       "doc_count": 1,
       "avg_income": {
         "value": 57287
       }
     }
   ]
 }
  }
}

Over half of the documents are concentrated in the ranges 0-5000000 and 5000000-1000000. Let’s look at a more detailed interval, to 5000000, and divide this interval using Range aggregation. It allows you to set any interval:

curl -XGET "http://localhost:9200/states/state/_search?pretty" -d'
{
  "size": 0,
  "aggs": {
 "income_by_population_group": {
   "range": {
     "field": "population",
     "keyed" : true,
     "ranges" : [
             { "to" : 1000000 },
             { "from" : 1000000, "to" : 2000000 },
             { "from" : 2000000, "to" : 3000000 },
             { "from" : 3000000, "to" : 5000000 }
         ]
   },
   "aggs": {
     "avg_income": {
       "avg": {
         "field": "income"
       }
     }
   }
 }
  }
}'

Note: We set the keyed flag to true. This associates a unique string key with each bucket, and returns the ranges as a hash instead of an array.

Response:

{ ….
  "aggregations": {
 "income_by_population_group": {
   "buckets": {
     "*-1000000.0": {
       "to": 1000000,
       "to_as_string": "1000000.0",
       "doc_count": 7,
       "avg_income": {
         "value": 56983.71428571428
       }
     },
     "1000000.0-2000000.0": {
       "from": 1000000,
       "from_as_string": "1000000.0",
       "to": 2000000,
       "to_as_string": "2000000.0",
       "doc_count": 8,
       "avg_income": {
         "value": 50056.375
       }
     },
     "2000000.0-3000000.0": {
       "from": 2000000,
       "from_as_string": "2000000.0",
       "to": 3000000,
       "to_as_string": "3000000.0",
       "doc_count": 6,
       "avg_income": {
         "value": 45233.333333333336
       }
     },
     "3000000.0-5000000.0": {
       "from": 3000000,
       "from_as_string": "3000000.0",
       "to": 5000000,
       "to_as_string": "5000000.0",
       "doc_count": 8,
       "avg_income": {
         "value": 46484.75
       }
     }
   }
 }
  }
}

Filters and Scope

A natural extension to aggregation scoping is filtering. Because the aggregation operates in the context of the query scope, any filter applied to the query will also apply to the aggregation. We filter out all the states in the namenew. For these documents, we calculate average population via the avg_population aggregation and compare it to the average population of all states. Elastic gives us access to a global bucket, we only need to specify keyword global via the our_global_bucket sub-aggregation:

curl -XGET "http://localhost:9200/states/state/_search?pretty" -d'
{
  "size": 0,
  "query": {
 "filtered": {
   "filter": {
     "term": {
       "name": "new"
     }
   }
 }
  },
  "aggs": {
 "avg_population": {
   "avg": {
     "field": "population"
   }
 },
 "our_global_bucket": {
   "global": {},
   "aggs": {
     "global_avg_population": {
       "avg": {
         "field": "population"
       }
     }
   }
 }
  }
}'

The query, which happens to include a filter, returns a certain subset of documents "hits”: "total": 4. The aggregation operates on those documents. As for  our_global_bucket, see that the field doc count is equal to 51, which corresponds to the total number of states in our index:

Response:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
 "total": 5,
 "successful": 5,
 "failed": 0
  },
  "hits": {
 "total": 4,
 "max_score": 0,
 "hits": []
  },
  "aggregations": {
 "avg_population": {
   "value": 7960276.75
 },
 "our_global_bucket": {
   "doc_count": 51,
   "global_avg_population": {
     "value": 6155177.2549019605
   }
 }
  }
}

Note: Be careful when you work with the filter term aggregation. When Elasticsearch detects a string field in your documents, it automatically configures it as a full-text string field and analyzes it with the standard analyzer: It lowercases all terms. For example, If we replace the word “new” to “New”, query return empty match. For details: Why doesn’t the term query match my document?

This can be done a different way. Let’s use Filter bucket aggregation.

curl -XGET "http://localhost:9200/states/state/_search?pretty" -d'
{
  "size": 0,
  "aggs": {
 "avg_population": {
   "filter": {
     "term": {
       "name": "new"
     }
   },
   "aggs": {
     "avg_population": {
       "avg": {
         "field": "population"
       }
     }
   }
 },
 "our_global_bucket": {
   "global": {},
   "aggs": {
     "global_avg_population": {
       "avg": {
         "field": "population"
       }
     }
   }
 }
  }
}'

Note: Global aggregators can only be placed as top level aggregators.

The answer will be the same, with a few differences: Total hits in query will be 51 and doc_count in aggregation is 4.

Response:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
 "total": 5,
 "successful": 5,
 "failed": 0
  },
  "hits": {
 "total": 51,
 "max_score": 0,
 "hits": []
  },
  "aggregations": {
 "avg_population": {
   "doc_count": 4,
 "avg_population": {
     "value": 7960276.75
  }
},
"our_global_bucket": {
   "doc_count": 51,
   "global_avg_population": {
     "value": 6155177.2549019605
  }
 }
  }
}

What is the difference between these two methods?

In the first example, filter affects both search results and aggregations. In the second, filter only affects aggregations. This allows us to flexibly receive any results.

Conclusion

Aggregation is a powerful and flexible tool with which we can make more informative search results to use for data analysis and visualization. In combination with the speed, Elasticsearch makes it attractive to choose and build analytical reports to get real-time analysis of data.

In this article there are no examples for aggregation scripting. Instead, it expands the standard features of aggregation, making them more flexible. Here you can see examples of the use of scripting. Before you use them, however, you need to be aware of data security. More details.

Like what you see? Subscribe to our newsletter or drop us a comment below.