In this article, we'll continue our overview of Elasticsearch bucket aggregations, focusing on significant terms and significant text aggregations. These aggregations are designed to search for interesting and/or unusual occurrences of terms in your datasets that can tell much about the hidden properties of your data. This functionality is especially useful for the following use cases:

  • Identifying relevant documents for the user queries containing synonyms, acronyms, etc. For example, the significant terms aggregation could suggest documents with "bird flu" when the user searches for H1N1. 
  • Identifying anomalies and interesting occurrences in your data. For example, by filtering documents based on location, we could identify the most frequent crime types in particular areas. 
  • Identifying the most significant properties of a group of subjects using the significant terms aggregation on integer fields like height, weight, income, etc. 

It should be noted that both significant terms and significant text aggregations perform complex statistical computations on documents retrieved by the direct query (foreground set) and all other documents in your index (background set). Therefore, both aggregations are computationally intensive and should be properly configured to work fast. However, once you master them with the help of this tutorial, you'll acquire a powerful tool for building very useful features in your applications and getting useful insights from your datasets. Let's get started!

Tutorial

Examples in this tutorial were tested in the following environment:

  • Elasticsearch 6.4.0
  • Kibana 6.4.0

Create Index Mapping

To illustrate how significant terms and significant text work, we'll first need to create a test "news" index storing a collection of news articles. The index mapping will contain such fields as an author, publication date, article's title, number of views, and topic. Let's create the mapping:

curl -XPUT "http://localhost:9200/news/" -H "Content-Type: application/json" -d'
{
   "mappings": {
      "misc": {
         "properties": {
            "published": {
               "type": "date",
               "format": "dateOptionalTime"
            },
            "author": {
               "type": "keyword"
            },
            "title": {
               "type": "text"
            },
             "topic": {
                 "type":"keyword"
             },
             "views": {
                 "type": "integer"
             }
         }
      }
   }
}'

As you see, we used the keyword datatype for the topic and author fields, and text datatype for the title field. To remind you, keyword fields are only searchable by their exact value, whereas text fields can be used for free-search.

Next, let's add some arbitrary news documents to the index using the Bulk API.

curl -XPOST "http://localhost:9200/news/_bulk" -H "Content-Type: application/json" -d '
{"index":{"_index":"news","_type":"misc"}}
{"author":"John Michael", "published":"2018-07-08", "title":"Tesla is flirting with its lowest close in over 1 1/2 years (TSLA)", "topic":"automobile","views":"431" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"John Michael", "published":"2018-07-22", "title":"Tesla to end up like Lehman Brothers (TSLA)", "topic":"automobile", "views":"1921" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"John Michael", "published":"2018-07-29", "title":"Tesla (TSLA) official says that they are going to release a new self-driving car model in the coming year", "topic":"automobile", "views":"1849" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"John Michael", "published":"2018-08-14", "title":"Five ways Tesla uses AI and Big Data", "topic":"ai", "views":"871" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"John Michael", "published":"2018-08-14", "title":"Toyota partners with Tesla (TSLA) to improve the security of self-driving cars", "topic":"automobile", "views":"871" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"Robert Cann", "published":"2018-08-25", "title":"Is AI dangerous for humanity", "topic":"ai", "views":"981" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"Robert Cann", "published":"2018-09-13", "title":"Is AI dangerous for humanity", "topic":"ai", "views":"871" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"Robert Cann", "published":"2018-09-27", "title":"Introduction to Generative Adversarial Networks (GANs) in self-driving cars", "topic":"automobile", "views":"1183" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"Robert Cann", "published":"2018-10-09", "title":"Introduction to Natural Language Processing", "topic":"ai", "views":"786" }
{"index":{"_index":"news","_type":"misc"}}
{"author":"Robert Cann", "published":"2018-10-15", "title":"New Distant Objects Found in the Fight for Planet X ", "topic":"astronomy", "views":"542" }
'

Significant Terms Aggregation

As we've already mentioned, a significant terms aggregation identifies unusual and interesting term occurrences in your data. The aggregation is very powerful for the following use cases:

  • Identifying relevant terms/documents associated with user's query. For example, when a user queries for "Spain," the aggregation could suggest such terms as "Madrid," "Corrida," or any other terms commonly met in documents about Spain.
  • The significant terms aggregation can be used for an automated news classifier where documents are classified based on the map of frequently connected terms.
  • Spotting anomalies in your data. For example, with the help of this aggregation, we can identify unusual crime types or diseases in certain geographical areas.

It is important to understand that the terms selected by the significant terms aggregation are not simply the most popular terms in your document set. For example, even if the acronym "MSFT" exists only in 10 documents in a 10 million document index, it can be still relevant if found in 10 out of 50 documents that match the user's query for "Microsoft." This frequency makes the acronym relevant to the user's search.

To identify significant terms, the aggregation performs complex statistical analysis of the search results matched by a query and the index from which the results were collected. Search results directly matching the query represent the foreground set, while the index from which they were retrieved represent the background set. The task of the significant terms aggregation is to compare these sets and find terms which are most frequently associated with the user's query.

This still sounds quite abstract, isn't it?

Let's use the real-world example and demonstrate how the aggregation works. In the example below, we'll try to find the significant topics for each individual writer in our index. To accomplish that, we first use the terms bucket aggregation on the author field. As you remember, the terms aggregation constructs buckets for all unique terms (i.e., authors) found the index. Next, we use the significant terms aggregation on the "topic" field to find out the most significant topic(s) for each individual writer. Take a look at the query below:

curl -X GET "localhost:9200/news/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "aggregations": {
        "authors": {
            "terms": {"field": "author"},
            "aggregations": {
                "significant_topic_types": {
                    "significant_terms": {"field": "topic"}
                }
            }
        }
    }
}
'

We use a simple configuration for the significant_terms aggregation that contains only the target "field." The response should look something like this:

"aggregations" : {
    "authors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "John Michael",
          "doc_count" : 5,
          "significant_topic_types" : {
            "doc_count" : 5,
            "bg_count" : 7,
            "buckets" : [
              {
                "key" : "automobile",
                "doc_count" : 4,
                "score" : 0.3200000000000001,
                "bg_count" : 4
              }
            ]
          }
        },
        {
          "key" : "Robert Cann",
          "doc_count" : 5,
          "significant_topic_types" : {
            "doc_count" : 5,
            "bg_count" : 8,
            "buckets" : [
              {
                "key" : "ai",
                "doc_count" : 3,
                "score" : 0.11999999999999997,
                "bg_count" : 4
              }
            ]
          }
        }
      ]
    }
  }

The response indicates that AI is the most significant topic key for Robert Cann and Automobile is the most significant topic for John Michael. You can verify that both authors have published papers on other topics, but they mostly specialize in AI and Automobiles (see the image below).

Note on Scoring: In the response above, you can see that each term bucket contains a score of the term. These scores are calculated from the doc frequencies in the foreground and background sets. In brief, a keyword is considered to be significant if there is a noticeable difference in the frequency in which a term appears in the foreground and in the background.

Significant terms aggregation can find significant terms on integer fields as well. To illustrate this, let's use the sports dataset from the previous tutorial on metrics aggregations. In this example, we want to find the most frequent ages of athletes in different sports categories:

curl -X GET "localhost:9200/sports/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "aggregations": {
        "authors": {
            "terms": {"field": "sport"},
            "aggregations": {
                "significant_topic_types": {
                    "significant_terms": {
                        "field": "age",
                        "min_doc_count": 2
                    }
                }
            }
        }
    }
}
'

As you see, along with the field:age, we use the min_doc_count parameter set to 2. As a result, terms buckets will be created only if there are 2 or more document hits for that term. Please note that in high-cardinality textual fields, setting min_doc_count to low values may cause various insignificant terms like prepositions and articles to be labeled as significant terms. Given that our index is not so large, setting min_doc_count to 2 turns out to be fine:

"aggregations" : {
    "authors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "significant_topic_types" : {
            "doc_count" : 9,
            "bg_count" : 21,
            "buckets" : [
              {
                "key" : 20,
                "doc_count" : 4,
                "score" : 0.07407407407407407,
                "bg_count" : 8
              }
            ]
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "significant_topic_types" : {
            "doc_count" : 5,
            "bg_count" : 15,
            "buckets" : [
              {
                "key" : 20,
                "doc_count" : 2,
                "score" : 0.4,
                "bg_count" : 3
              }
            ]
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "significant_topic_types" : {
            "doc_count" : 5,
            "bg_count" : 18,
            "buckets" : [
              {
                "key" : 20,
                "doc_count" : 2,
                "score" : 0.3200000000000001,
                "bg_count" : 4
              }
            ]
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "significant_topic_types" : {
            "doc_count" : 3,
            "bg_count" : 4,
            "buckets" : [
              {
                "key" : 29,
                "doc_count" : 2,
                "score" : 0.22222222222222215,
                "bg_count" : 2
              }
            ]
          }
        }
      ]
    }
  }

As you see, 20 is the most frequent age in Basketball, Football, and Hockey, and 29 is the most common age in Handball.

Note that you can use significant terms aggregation only on integer fields. Floating point fields are not currently supported. That's because integer or long fields can be used to represent concepts interesting for the analysis (e.g., age, bank account number) whereas floating fields usually represent quantities of something. As such, individual floating point terms are not very useful for this type of frequency analysis.

Using Significant Terms with Free-Text Fields

The significant_terms aggregation can be also used on tokenized free-text fields to refine end-user searches and suggest keywords for use in percolator queries. Doing so, however, is not generally recommended. Picking a free-text field for the significant terms analysis can be very computationally expensive because the aggregation will attempt to load every unique word into RAM. However, that's not a big deal if you have a small index. In other cases, if you want a significant terms functionality for text fields, the significant text aggregation is a better option.

Our index is quite small so we don't risk running into the memory problem. Let's go ahead and use the significant terms for a free-text field in our index. In this example, we'll try to find interesting terms associated with the queries that contain a "self-driving" term. As in the previous example, we use the "min_doc_count" set to 4 to exclude insignificant terms like prepositions and articles:

curl -X GET "localhost:9200/news/misc/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : {"title" : "self-driving"}
    },
    "aggregations": {
        "keywords" : {
           "significant_text" : { "field" : "title", "min_doc_count":2 }
              }
       }
}
'

And here is the response to the query:

"aggregations" : {
    "keywords" : {
      "doc_count" : 4,
      "bg_count" : 10,
      "buckets" : [
        {
          "key" : "cars",
          "doc_count" : 2,
          "score" : 1.5555555555555554,
          "bg_count" : 2
        }
      ]
    }
  }

As you see, the aggregation found one significant term for the "self-driving" query. The most relevant term associated with this query in our document set is "cars." However, as we've mentioned, If you want to find significant words in the free-text fields, you should opt for the significant text aggregation, which we will discuss in the following.

Custom Background Sets

By default, the foreground set is compared against a background set of all the documents in your index. However, under certain scenarios, you may want to construct a narrower background set as the basis for your comparison. For example, when the user queries "Paris" on documents with content from all over the world, the query could reveal that "French" was a significant term. Although this is true, you may want to have some more focused terms. In this case, you could apply a background_filter on the term france to construct a narrower set of documents as the query context. In this case, "French" will now be seen as an insignificant term. However, using background filters tends to slow the query.

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : {
            "city" : "paris"
        }
    },
    "aggs" : {
        "tags" : {
            "significant_terms" : {
                "field" : "tag",
                "background_filter": {
                        "term" : { "text" : "france"}
                }
            }
        }
    }
}
'

Size and Shard Size

Significant terms aggregation returns top matching term buckets first provided by each shard of the index and then refined by the aggregation. You can use the size parameter to control a number of term buckets returned out of the full terms list. However, if the number of unique terms is greater than the size you set, the list can become less accurate.

To ensure that the returned list is accurate, you can use size parameter in pair with shard_size. The shard_size parameter controls the volume of candidate keywords produced by each shard. If you are interested in low-frequency terms, you may want to set the shard_size parameter to values significantly higher than the size setting. This ensures a high volume of promising candidate terms. If shard_size is set to -1 (the default) then shard_size will be automatically estimated based on the number of shards and the size parameter.

Significant Text Aggregation (experimental feature)

Significant text aggregation is specifically designed for finding significant terms in free-text fields. This is an experimental feature as of Elasticsearch 6.4.0, and its usage on large datasets may require much time and memory. Therefore, it is often recommended that the significant_text aggregation be used as a child of either the sampler or diversified sampler aggregation and that the analysis be limited to a small selection of top-matching documents (e.g., 100).

The main differences between the significant text aggregation and significant terms aggregation are the following:

  • It is designed for use on type text fields.
  • It does not require doc-values or field data.
  • It can re-analyze text on-the-fly, which means the aggregation can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

Similarly to significant terms aggregation, the significant text uses foreground and background sets to perform the statistical analysis of the word frequency. Let's look at the example below:

curl -X GET "localhost:9200/news/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : {"title" : "TSLA"}
    },
    "aggregations" : {
        "my_sample" : {
            "sampler" : {
                "shard_size" : 100
            },
            "aggregations": {
                "keywords" : {
                    "significant_text" : { "field" : "title","min_doc_count":4 }
                }
            }
        }
    }
}
'

As you see, we use the significant_text inside the sampler aggregation to limit the analysis to the selection of top-matching elements. Our index is small so we could do without the sampler, but you should remember to include this aggregation if your index is great in size. As a child aggregation of the sampler, the significant text searches for significant terms for "TSLA" query in the "title" field of your document set.

The response should look something like this:

"aggregations" : {
    "my_sample" : {
      "doc_count" : 4,
      "keywords" : {
        "doc_count" : 4,
        "bg_count" : 10,
        "buckets" : [
          {
            "key" : "tesla",
            "doc_count" : 4,
            "score" : 1.5,
            "bg_count" : 4
          },
          {
            "key" : "tsla",
            "doc_count" : 4,
            "score" : 1.5,
            "bg_count" : 4
          }
        ]
      }
    }
  }

The results show that "tesla" key is strongly associated with the stock acronym TSLA.

Excluding Noisy Data

Noisy data in free-text fields may include cut-and-paste paragraphs, bolierplate headers/footers, sidebar news, addresses, etc. This often happens when some lengthy text is cut and pasted by a number of resources, and, as a result, any rare names or numbers not related to the query but contained in the pasted fragments become statistically correlated with the matching query.

Fortunately, the significant_text aggregation can apply a filter to remove sequences of any 6 or more tokens that have already been seen. If you field contains full html pages, there may be many duplicates associated with this noisy data. Elasticsearch can help you clean that data on-the-fly using the filter_duplicate_text setting. For example:

curl -X GET "localhost:9200/news/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 100
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  }
}
'

Significant text aggregation has the following limitations:

  • Does not support child aggregations. This limitation is intentional because using child aggregations will have a high memory cost. The suggested approach in dealing with this limitation is making a follow-up query using a terms aggregation with an include clause and child aggregations to perform further analysis of selected keywords in a more efficient fashion.
  • Does not support nested objects. The significant_text aggregation currently cannot work on text fields in nested objects, because it works with the document JSON source.
  • The counts of documents containing a term are based on summing the samples returned from each shard. These counts can be low if some shards did not list a given term in their top sample or, alternatively, high when considering the background frequency because it may count occurrences found in deleted documents. This is a result of a trade-off in which Elasticsearch developers have chosen to provide fast performance at the cost of some (typically small) inaccuracies.

Similarly to the significant terms aggregation, the significant text aggregation has support for custom background contexts and the same ranking parameters.

Conclusion

In this article, we've covered two important buckets aggregation in Elasticsearch: significant term and significant text aggregations. Both of them are efficient in identifying unusual or interesting occurrences of terms in your Elasticsearch indices. This makes these aggregation useful for anomaly detection and relevant documents suggestion. You should use the significant terms aggregation with keyword fields since it's optimized for it. In contrast, the significant text aggregation is an experimental feature specifically designed for free-text fields. Both aggregations are memory- and time-consuming, so you should carefully design your queries to avoid high memory footprint. We've covered basic requirements for that in this article.