This blog posts continues our overview of Elasticsearch metrics aggregation. We will focus here on such metrics aggregations as geo bounds, geo centroid, percentiles, percentile ranks, and some other single-value and multi-value aggregations. By the end of this series, you'll have a good understanding of metrics aggregations in Elasticsearch including some important statistical measures and how to visualize them in Kibana. Let's get started!

Tutorial

Examples in this tutorial were tested in the following environment:

  • Elasticsearch 6.4.0
  • Kibana 6.4.0

Creating a New Index

As we did in Part I of this series, let's first create a new "sports" index storing a collection of "athlete" documents. The index mapping will contain such fields as athlete's location, name, rating, sport, age, number of scored goals, and field position (e.g., defender). Here is the mapping:

curl -XPUT "http://localhost:9200/sports/" -H "Content-Type: application/json" -d'
{
   "mappings": {
      "athlete": {
         "properties": {
            "birthdate": {
               "type": "date",
               "format": "dateOptionalTime"
            },
            "location": {
               "type": "geo_point"
            },
            "name": {
               "type": "keyword"
            },
            "rating": {
               "type": "integer"
            },
            "sport": {
               "type": "keyword"
            },
             "age": {
                 "type":"integer"
             },
             "goals": {
                 "type": "integer"
             },
             "role": {
                 "type":"keyword"
             },
             "score_weight": {
                 "type": "float"
             }
         }
      }
   }
}'

Let's next use Elasticsearch Bulk API to save some data to our index. Bulk indexing allows sending multiple documents to the index in a single call:

curl -XPOST "http://localhost:9200/sports/_bulk" -H "Content-Type: application/json" -d'
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Football", "rating": ["5", "4"],  "location":"31.22,-97.45", "age":"23","goals": "43","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bob", "birthdate":"1989-11-2", "sport":"Football", "rating": ["3", "4"],  "location":"33.21,-87.35", "age":"33", "goals": "54","score_weight":"2", "role":"forward"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Jim", "birthdate":"1988-10-3", "sport":"Football", "rating": ["3", "2"],  "location":"35.16,-99.58", "age":"28", "goals": "73", "score_weight":"2", "role":"forward" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Joe", "birthdate":"1992-5-20", "sport":"Basketball", "rating": ["4", "3"],  "location":"38.22,-98.53", "age":"18", "goals": "848", "score_weight":"3", "role":"midfielder"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Tim", "birthdate":"1992-2-28", "sport":"Basketball", "rating": ["3", "3"],  "location":"32.22,-100.85", "age":"28","goals": "942", "score_weight":"2","role":"forward"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Alfred", "birthdate":"1990-9-9", "sport":"Football", "rating": ["2", "2"],  "location":"29.12,-98.35", "age":"25", "goals": "53", "score_weight":"4", "role":"defender"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Jeff", "birthdate":"1990-4-1", "sport":"Hockey", "rating": ["2", "3"], "location":"32.12,-95.55", "age":"26","goals": "93","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Will", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["4", "4"], "location":"34.25,-92.25", "age":"27", "goals": "124", "score_weight":"2", "role":"forward" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Football", "rating": ["3", "4"],  "location":"35.22,-89.45", "age":"35","goals": "56","score_weight":"3", "role":"midfielder"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Basketball", "rating": ["1", "3"],  "location":"37.21,-98.35", "age":"34","goals": "1483","score_weight":"2", "role":"forward"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Ray", "birthdate":"1988-10-3", "sport":"Football", "rating": ["2", "2"],  "location":"38.16,-93.58", "age":"31","goals": "84", "score_weight":"3", "role":"midfielder" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Basketball", "rating": ["4", "3"],  "location":"35.22,-98.53", "age":"27","goals": "1328", "score_weight":"2", "role":"forward"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Hockey", "rating": ["5", "2"],  "location":"36.22,-92.85", "age":"41","goals": "218", "score_weight":"2", "role":"forward"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Hockey", "rating": ["4", "2"],  "location":"38.12,-92.35", "age":"18","goals": "148", "score_weight":"3", "role":"midfielder"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Football", "rating": ["3", "2"], "location":"38.19,-94.55", "age":"19","goals": "34", "score_weight":"4", "role":"defender"}
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Football", "rating": ["5", "2"], "location":"36.45,-99.15", "age":"20", "goals": "48", "score_weight":"4", "role":"defender" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Handball", "rating": ["6", "4"], "location":"36.25,-94.53", "age":"25", "goals": "150", "score_weight":"4", "role":"defender" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Handball", "rating": ["10", "7"], "location":"36.25,-98.55", "age":"29", "goals": "143", "score_weight":"3", "role":"midfielder" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"31.25,-94.55", "age":"36", "goals": "1284", "score_weight":"2", "role":"forward" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"36.21,-98.55", "age":"25", "goals": "113", "score_weight":"3", "role":"midfielder" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Handball", "rating": ["10", "10"], "location":"33.24,-94.55", "age":"29", "goals": "443", "score_weight":"2", "role":"forward" }
{"index":{"_index":"sports","_type":"athlete"}}
{"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"36.25,-94.55", "age":"24", "goals": "49", "score_weight":"3", "role":"midfielder" }
'

Great! We are ready to work with this data. Let's start with the geo bounds aggregation.

Geo Bounds Aggregation

The geo bounds aggregation is useful when you want to find the geographical boundaries of your geo data. Formally speaking, this aggregation computes the bounding box containing all geo_point values for a given field.

Our "sports" index contains one geo_point field named "location" suitable for this type of aggregation. Let's calculate the geo bounds for all geo_point values of this field:

curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "viewport" : {
            "geo_bounds" : {
                "field" : "location", 
                "wrap_longitude" : true 
            }
        }
    }
}
'

This query specifies the field to obtain the longitude and latitude values from and wrap_longitude setting. The latter is an optional parameter and specifies whether the bounding box is allowed to overlap the international date line.

The response to this query should look something like this:

...
"aggregations" : {
    "viewport" : {
      "bounds" : {
        "top_left" : {
          "lat" : 38.21999997831881,
          "lon" : -100.85000007413328
        },
        "bottom_right" : {
          "lat" : 29.11999996751547,
          "lon" : -87.35000004060566
        }
      }
    }
  }
}

To get the essence of this data, we can visualize the bounding box with the Google Maps API or any other API/software of your choice (see the image below).

Elasticsearch: Geo Bounds Aggregation

As you see, top_left.lat becomes the north coordinate; the bottom_right.lat -- the south;  top_left.lon -- the west; and bottom_right.lon -- the east. As you see, our bounding box contains parts of several states (Oklahoma, Arkansas, Texas, Louisiana, Missouri, and others). 

Geo Centroid Aggregation

This metrics aggregation computes the weighted centroid from all coordinate values in a geo_point datatype field across all documents specified in the query.

In geometry, a centroid is the arithmetic mean position of all the points in the figure. Applying this concept to geographical coordinates, centroid may be thought of as an arithmetic mean of all latitude-longitude pairs in the geo_point field of aggregated documents. Centroid is a useful measure if the shape of the territory is complex as in the example below:

Source: GIS Stack Exchange

We can use the centroid aggregation to find the central location of all athletes in the "sports" index:

curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "centroid" : {
            "geo_centroid" : {
                "field" : "location" 
            }
        }
    }
}
'

The only field we need for this aggregation is the field containing the geo_point values. The response should look something like this:

...
"aggregations" : {
    "centroid" : {
      "location" : {
        "lat" : 34.98909085375172,
        "lon" : -95.63636379570447
      },
      "count" : 22
    }
  }

The 34.98909085375172, -95.63636379570447 is the centroid of all athlete locations in our index.

Additionally, you can use geo_centroid aggregation as a sub-aggregation to other bucket aggregation. For example, we can combine terms aggregation with geo_centroid aggregation to find the central location of all athletes in each sports type.

curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "sports" : {
            "terms" : { "field" : "sport" },
            "aggs" : {
                "centroid" : {
                    "geo_centroid" : { "field" : "location" }
                }
            }
        }
    }
}
'

Elasticsearch will construct a bucket for each unique value in the "sports" field and calculate a geo_centroid value for each bucket:

...
"aggregations" : {
    "sports" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "centroid" : {
            "location" : {
              "lat" : 34.77555552031845,
              "lon" : -94.89000006578863
            },
            "count" : 9
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "centroid" : {
            "location" : {
              "lat" : 34.82399996649474,
              "lon" : -98.1620000526309
            },
            "count" : 5
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "centroid" : {
            "location" : {
              "lat" : 35.38399997167289,
              "lon" : -94.31000005826354
            },
            "count" : 5
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "centroid" : {
            "location" : {
              "lat" : 35.24666663724929,
              "lon" : -95.87666675448418
            },
            "count" : 3
          }
        }
      ]
    }
  }

Percentiles Aggregation

A percentile is a useful statistical measure that indicates the value below which a given percentage of observations in a group fall. For example, the 75th percentile is the value below which 75% of observations may be found. Percentiles are often used to find outliers in the data set. For example, in a normal distribution of population, the 0.13th and 99.87th percentiles represent 3 standard deviations from the mean. Any data that lies outside these bounds is considered to be an anomaly. Along with finding data outliers, percentiles aggregation may be useful for determining if the data is skewed, bimodal, etc.

Because percentiles aggregation can return a range of user-specified percentiles, it's considered to be a multi-value metrics aggregation. By default, Elasticsearch calculates approximate percentiles using the TDigest algorithm (introduced by Ted Dunning in Computing Accurate Quantiles using T-Digests). The algorithm has certain caveats to remember: 

  • The algorithm's accuracy is proportional to q(1-q). The extreme percentiles (e.g., 99%) are more accurate than less extreme percentiles like the median.
  • The algorithm is highly accurate for small sets of values.
  • With the growing number of values in buckets, the algorithm starts to approximate the percentiles. It effectively trades accuracy for memory saving. The exact level of inaccuracy depends on your data distribution (whether the data is normally distributed) and the volume of data being aggregated.

Elasticsearch has an alternative percentiles algorithm implementation -- HDR Histogram (High Dynamic Range Histogram). This algorithm can be faster than the TDigest implementation, however, with the trade-off of a larger memory footprint. For more information about HDR Histogram, please, consult the official Elasticsearch documentation.

In the example below, we use TDigest to calculate percentiles for the "goals" field:

curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "sport_categories":{
            "terms":{"field":"sport"},
            "aggs": {
                "scoring_percentiles" : {
                   "percentiles" : {
                      "field" : "goals",
                      "tdigest": {
                         "compression" : 200 
                       }
                    }
                }
            }
        } 
    }
}
'

By default, the percentile metric will generate this range of percentiles [ 1, 5, 25, 50, 75, 95, 99 ]. We also specified the "compression" parameter that controls memory usage and approximation error. By increasing the compression value (the default is 100), you can improve the accuracy of your percentiles calculation at the cost of more memory. 

However, larger compression value makes the algorithm slower. Because our data set is not that large, the compression value will not have a tangible effect. In our case, the percentile calculation would be accurate even without changing the default compression, but we included that field for demonstration purposes.

The query above will return the following response:

...
"aggregations" : {
    "sport_categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 34.72,
              "5.0" : 37.599999999999994,
              "25.0" : 48.0,
              "50.0" : 53.0,
              "75.0" : 56.0,
              "95.0" : 79.6,
              "99.0" : 83.12
            }
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 851.7599999999999,
              "5.0" : 866.8000000000001,
              "25.0" : 942.0,
              "50.0" : 1284.0,
              "75.0" : 1328.0,
              "95.0" : 1452.0,
              "99.0" : 1476.8
            }
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 93.8,
              "5.0" : 97.0,
              "25.0" : 113.0,
              "50.0" : 124.0,
              "75.0" : 148.0,
              "95.0" : 203.99999999999997,
              "99.0" : 215.20000000000002
            }
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 143.14,
              "5.0" : 143.70000000000002,
              "25.0" : 146.5,
              "50.0" : 150.0,
              "75.0" : 296.5,
              "95.0" : 413.7,
              "99.0" : 437.14
            }
          }
        }
      ]
    }
  }

This data is very useful for understanding how goal number is distributed in different sports. For example, Handball has a pretty dispersed goal distribution with the 1st percentile at 143.14 goals and the 99th at 437.14 goals.

Let's visualize the aggregation in Kibana to get a better idea of this data's meaning:

If you are interested only in the outliers in your data, you can return only specific percentiles using the "percents" field of the aggregation. For example, in the query below we specify only the most extreme percentiles to be returned:

curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "sport_categories":{
            "terms":{"field":"sport"},
            "aggs": {
                "scoring_percentiles" : {
                   "percentiles" : {
                      "field" : "goals",
                      "percents" : [99, 99.9]
                    }
                }
            }
        } 
    }
}
'

This query should produce the following response:

"aggregations" : {
    "sport_categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "scoring_percentiles" : {
            "values" : {
              "99.0" : 83.12,
              "99.9" : 83.912
            }
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "99.0" : 1476.8,
              "99.9" : 1482.38
            }
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "99.0" : 215.20000000000002,
              "99.9" : 217.72000000000003
            }
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "scoring_percentiles" : {
            "values" : {
              "99.0" : 437.14,
              "99.9" : 442.41400000000004
            }
          }
        }
      ]
    }
  }

 

Percentiles Rank Aggregation

As we already know, a percentile indicates a certain percentage of values that fall below that percentile. For example, if the value is in the 30th percentile, 30% of values are below this value. The "30" is called the percentile rank. The percentiles' rank aggregation allows determining the rank of the particular score (e.g., number of goals). To understand the difference between this aggregation and regular percentiles aggregation, let's see the example below:

curl -X GET "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "goal_ranks" : {
            "terms": {"field":"sport"},
            "aggs": {
                "percentile_goals":   {
                "percentile_ranks" : {
                   "field" : "goals", 
                   "values" : [100,200]
                  }
               }
            }
        }
    }
}
'

As you see, instead of specifying what percentiles to return, we specify values for which to calculate the percentile rank. Thus, the percentiles rank aggregation may be thought of as the reverse form of the regular percentile aggregation. However, the calculation algorithm and approximation rules used in both aggregations are the same. The above query will return the following response:

"aggregations" : {
    "goal_ranks" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "percentile_goals" : {
            "values" : {
              "100.0" : 100.0,
              "200.0" : 100.0
            }
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "percentile_goals" : {
            "values" : {
              "100.0" : 0.0,
              "200.0" : 0.0
            }
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "percentile_goals" : {
            "values" : {
              "100.0" : 17.0,
              "200.0" : 64.85714285714286
            }
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "percentile_goals" : {
            "values" : {
              "100.0" : 0.0,
              "200.0" : 22.35494880546075
            }
          }
        }
      ]
    }
  }

As you see, the rank of 100 and 200 scores for basketball is zero because the minimal number of points scored in that sport is above 200 in our index.

Kibana supports visualization of both percentiles and percentiles rank aggregations. To visualize the percentiles rank, we have to select the aggregation and apply it on the "goals" field, add scores (e.g., 100 and 200) for which we want to calculate ranks and then create a terms sub-aggregation on the "sports" field in the X-Axis:

 

Sum Aggregation

Sometimes you need to sum up all values extracted from some numeric field. Elasticsearch comes with a built-in support for the sum aggregation that can perform this task. For example, using this aggregation we can sum up all goals/points scored by all players in each sport type:

curl -X GET "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "goal_ranks" : {
            "terms": {"field":"sport"},
            "aggs": {
                "sum_of_goals":   {
                "sum" : {
                   "field" : "goals"
                  }
               }
            }
        }
    }
}
'

This aggregation is very simple: you only need to define a numeric field from which to extract values, and Elasticsearch will sum them up:

...
"aggregations" : {
    "goal_ranks" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "sum_of_goals" : {
            "value" : 494.0
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "sum_of_goals" : {
            "value" : 5885.0
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "sum_of_goals" : {
            "value" : 696.0
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "sum_of_goals" : {
            "value" : 736.0
          }
        }
      ]
    }
  }

Conclusion

That's it! We have completed our overview of some of the most exciting metrics aggregation in Elasticsearch that can be useful not only for a regular Elasticsearch user but for statisticians and data scientists as well. You are now equipped with powerful tools needed to analyze your geodata, identify outliers, and assess how evenly your data set is distributed.

Elasticsearch is constantly expanding support for various statistical methods and measures. As you already know, for example, Elasticsearch 6.4.0 introduced a new weighted average aggregation. We are closely following new additions and changes and will cover them in future blogs.