Three new aggregation types were released in Elasticsearch v1.1.0:

  • Significant Terms
  • Cardinality
  • Percentile

As we did in our earlier aggregations post, we will explain these new aggregations through examples. If you’ve never used aggregations before, please visit our introduction before you begin this tutorial. To kick things off we’ll start a local Elasticsearch cluster and import our data.

Install and Start Elasticsearch v1.1.0

http://www.elasticsearch.org/download

You’ll need v1.1.0 of Elasticsearch running on your local machine or qbox instance to use these new aggregations. If you want to learn how to setup and run Elasticsearch locally, watch or read Elasticsearch Tutorial Episode #1. Once you’ve launched your Elasticsearch cluster, we will index the documents mentioned in the github repository below.

Github Repo

https://github.com/StackSearchInc/new-elasticsearch-aggregations

Download the github repository linked above and open the new-aggs-data file to see what documents we will search on. The mapping and settings for these are dynamically created because the the requests in this tutorial do not require any customization, and all requests mentioned are included in the requests file.

These examples do require an integer field for the Percentiles Aggregation, but the mapping and index settings will remain fairly default. If you customize these example aggregations and have any questions about about them or have trouble with them, create a http://sense.qbox.io/gist/ and place it in a comment at the bottom of this post with your explanation, and I will assist you.

Aggregations

<img src="//cdn2.hubspot.net/hub/307608/file-596856258-png/aggregations.png?t=1439489813512&width=657" alt="aggregations" width="657" style="width: 657px;">

An Aggregation request structure should be intuitive for anyone experienced with using Elasticsearch. If you don’t understand the structure of Aggregations or would like a refresh, please read our Introduction to Elasticsearch Aggregations for instructions and explanations here.

Significant Terms Aggregations

The Significant Terms Aggregation counts unusual occurrences of terms gathered from a background and foreground frequency. For example, in the terms query count of athletes with a “height”: “200” there is only about 7% of all athletes (591/9073) in the sports index or the bg_count (background count) in the significant terms agg.

Documents with “sport”: “basketball” have about 17% (300/2781) of the documents with “height”: “200”, or the doc_count/buckets.doc_count (foreground count) in the significant terms agg. The documents with “height”: “200” increase to almost two and one half fold the frequency. Showing ”height”: “200” has a significant increase in frequency for documents with the sport basketball.

curl -XGET 'localhost:9200/sports/athlete/_search?pretty' -d '{
 "query" : {
   "terms" : {"sport" : [ "basketball" ]}
 },
 "aggregations" : {
   "significantHeight" : {
     "significant_terms" : { "field" : "height" }
   }
 }
}'

You’ll see when you run this aggregation that “height”: “200” isn’t just the most popular term, it is also the most significant in its increase from background to foreground. In most cases, we want to do this across every sport. Adding terms to your terms query can accomplish this, but it won’t scale well for most. We therefore do a bucket aggregation on terms and bucket the most significant_terms for each term’s bucket.

curl -XGET 'localhost:9200/sports/athlete/_search?pretty' -d '{
   "aggregations": {
     "sports": {
       "terms": {"field": "sport"},
         "aggregations": {
           "significantHeight": {
             "significant_terms": {"field": "height"}
           }
         }
     }
   }
}'

You can provide the size parameter in your request to define the number of buckets returned from the full list. For default accuracy, a multiple of the final size is used for the number of terms to request from each shard using a heuristic of the number of shards. To take manual control on this setting, the shard_size parameters can be used to specify the volume of terms produced by each shard. The min_doc_count parameter will limit terms returned to only those who have more hits than min_doc_count. Filtering terms to “exclude” and “include” values can be used, which also accept regex based on the java Pattern class. The execution_hint parameter can increase performance if you’re certain of which mechanism your significant terms aggregation will execute (map or ordinals).

Obviously with a “match_all,” there would be no foreground/background for the significant_terms to work with, and at this time there is no way to specify a background. Only the index you draw the results from will create the background. Floating point fields are not currently supported as the subject of a significant terms analysis because they’re usually used to represent quantities, causing them to not be useful for significant terms analysis. Script or DocValues are not supported for significant terms aggregations aswell because they’re too expensive to use.

Cardinality Aggregation

Cardinality counts the unique values for a specified field, allowing you to search for unique documents in Elasticsearch. This is accomplished using a computed hash of the field to create unique buckets for each specified field values, generated by a script or inside the document itself. Cardinality is based on the HyperLogLog++ algorithm, which counts based on the hashes of values with properties mentioned below.

curl -XGET 'localhost:9200/sports/athlete/_search?pretty' -d '{
  "aggs" : {
    "author_count" : {
      "cardinality" : {
        "field" : "name",
        "precision_threshold": 100
      }
    }
  }
}'

The precision_threshold options allow you to trade memory for accuracy. That is, defining a unique count might become more fuzzy, although it is expected to be accurate. The max precision_threshold is 40000. At anywhere above 40000, the numbers will have the same effect as this maximum threshold. If you computed and stored a hash into the document (using the murmur3 type) and want to compute counts using this hash, it is possible to specify “rehash”: false, which defaults to true. Note that this hash must be indexed as a long for use in this aggregation.

Elasticsearch recently released “Count on Elasticsearch!” a post featuring the new cardinality aggregation. “Count on Elasticsearch!” offers great use cases and an explanation of HyperLogLog++. For further explanation of Cardinality and HyperLogLog++ read the post here

Percentile Aggregation

Percentiles is a multi-value metric aggregation calculating one or more percentiles on numeric values. For instance, a “rating” integer field could show a range of 1-8 as being the norm (1st, 5th, 25th, 50th, and 75th percentile), while the 95th and 99th percentiles show a much higher “rating” range (8-10). This allows you to quickly find percentiles on your specified percentile ranges.

These are the default percentiles (1st-99th) and can be configured, as shown in the second example below. Scripting can also be used as the metric to calculate on, if, for example, we had a field with multiple decimals that we wanted round for a more consistent grouping.

curl -XGET 'localhost:9200/sports/athlete/_search?pretty' -d '{
  "aggs" : {
    "rating_overview" : {
      "percentiles" : {
        "field" : "rating"
      }
    }
  }
}'
curl -XGET 'localhost:9200/sports/athlete/_search?pretty' -d '{
  "aggs" : {
    "rating_outlier" : {
      "percentiles" : {
        "field" : "rating",
        "percents" : [95, 99, 99.9]
      } 
    }
  }
}'

The percentile metric uses the TDigest algorithm (introduced by Ted Dunning in Computing Accurate Quantiles using T-Digests). To find out how these percentiles are calculated, visit the elasticsearch documentation or read up on T-Digest. Although it isn’t required to use this aggregation, it is informative and valuable for understanding it.

You can use the compression parameter to approximate balance of memory utilization and accuracy. By increasing the compression value, you can increase the memory usage, which can increase accuracy. Larger compressions do make the algorithm slower because the TDigest algorithm data tree grows in size and results in more expensive operations.

Qbox will continue to release tutorials on new aggregation types released from Elasticsearch. If you enjoyed this tutorial or have suggestions, please feel free to add a comment down below.

As we mentioned in the percentiles section above, check out the TDigest algorithm to understand how percentiles is working under the hood. If you’re looking for more application-level examples of Elasticsearch, please check out Elasticsearch Tutorial #1 and #2 -- and stayed tuned for #3 (to be released today).