It can be challenging to get the right outcomes from your Elasticsearch aggregations. But it’s possible to get precise results with tokenization, exact mappings, and a custom analyzer.

In this article, we explain some of the subtleties that are inherent in the design of the Elasticsearch analyzer. We help you understand a common cause of erroneous result sets. Then we show you two methods for improving the results and getting them to be entirely accurate. We also provide many resources to help you gain proficiency in ES aggregations.

For an Elasticsearch developer, getting your aggregations right can sometimes make you feel like a novice sausage maker. Lots of trial and error. Quite squishy. Certainly messy. But it is possible to get precise results with tokenization, precise mapping, and a custom analyzer. Quite possible, and rather easy. But it helps to understand some basic principles. The reward might be better than your momma’s cooking.

In this article, we take a look at an example of unexpected results that ES users commonly encounter when running terms aggregations. Then we walk through two methods for improving the results and achieving full accuracy.

If necessary, you can review our extensive articles on aggregations. Each one contains many examples to help you learn and gain proficiency:

Example Data

To provide examples for this tutorial, we begin by indexing four documents under the type famousbooks. Each document contains the details of a book including its author, publishing year, and genre:

Document 1

curl -XPOST 'http://localhost:9200/authors/famousbooks/1' -d '{
  "Book": "The Mysterious Affair at Styles",
  "Year": 1920,
  "Price": 5.92,
  "Genre": "Crime Novel",
  "Author": "Agatha Christie"
}'

Document 2

 curl -XPOST 'http://localhost:9200/authors/famousbooks/2' -d '{
  "Book": "And Then There Were None",
  "Year": 1939,
  "Price": 6.99,
  "Genre": "Mystery Novel",
  "Author": "Agatha christie"
}'

Document 3

 curl -XPOST 'http://localhost:9200/authors/famousbooks/3' -d '{
  "Book": "The Corrections",
  "Year": 2001,
  "Price": 6,
  "Genre": "Fiction",
  "Author": "Jonathan Franzen"
}'

Document 4

 curl -XPOST 'http://localhost:9200/authors/famousbooks/4' -d '{
  "Book": "The Barricade",
  "Year": 1987,
  "Price": 7,
  "Genre": "Horror",
  "Author": "Ainsley Christie "
}'

In documents 1 and 2, notice that there is a small case variation in the author field. This is quite intentional, as it will illustrate a subtlety that can cause headaches as your strive to get your aggregations right.

Simple Terms Aggregation

Let’s perform an aggregation on the documents in our index, classifying them according to their authors—the value in the author field in each document. We can achieve this with a simple terms aggregation:

curl -XGET 'http://localhost:9200/authors/famousbooks/_search?&pretty=true&size=3' -d '{
  "size": 0,
  "aggs": {
    "authors-aggs": {
      "terms": {
        "field": "Author"
      }
    }
  }
}'

When we run this aggregation, we obtain the following result:

"aggregations": {
  "authors-aggs": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "christie",
        "doc_count": 3
      },
      {
        "key": "agatha",
        "doc_count": 2
      },
      {
        "key": "ainsley",
        "doc_count": 1
      },
      {
        "key": "franzen",
        "doc_count": 1
      },
      {
        "key": "jonathan",
        "doc_count": 1
      }
    ]
  }
}

But wait! These are not the results we expect! With respect to the key, we should get:

  • Agatha Christie : 2 documents
  • Ainsley Christie : 1 document
  • Jonathan Franen : 1 document

We explain the remedy in the following sections.

Understanding the Logic that Gives Unexpected Results

Let’s explore why we got erroneous results.

First, we need to learn just a little bit about the indexing process of Elasticsearch. Because much of the Elasticsearch foundation is built on the Lucene library, the indexing processes are largely governed by the Lucene design. A brief illustration will help clarify.

Let’s say that we have two set of documents with the following data:

Document1 = "Mapping is important"
Document2 = "Mapping is very important"

Internally, Lucene resolves a string into distinct words, and then performs a reverse-mapping against the documents in which they occur. We depict the reverse index like this:

"Mapping" → Document1, Document2
"is" → Document1, Document2
"important" → Document1, Document2
"very" → Document2

We have to realize this: when we apply aggregation to documents, the aggregation will occur at the level of the individual terms rather than the original plain text. So, we will get the aggregation keys as words—not the original text string.

In our particular case, the author names are split into individual terms and the aggregation is done on those terms. The results are in accordance with the default Elasticsearch/Lucene analysis process, but it doesn’t provide us with accurate results — so we need a better approach.

We outline below two methods for bypassing the default Elasticsearch analysis.

Partial Results

The first method is to constrain Elasticsearch to map the content of the field as the exact value, which is done quite easily by simply setting the index value for the author field to be not_analyzed.

It’s necessary, however, to delete the original index, perform another index of these documents, and then apply the not_analyzed mapping.

This command will delete the index:

curl -X DELETE "http://localhost:9200/authors"

Next, we apply the following mapping to the index. By setting the index value to not_analyzed, we are constraining Elasticsearch to map the author field as an exact value.

curl -X PUT "http://localhost:9200/authors1/famousbooks/_mapping" -d '{
  "famousbooks": {
    "properties": {
      "Author": {
        "type": "string",
        "index": "not_analyzed"
      }
    }
  }
}'

Now we need to index all of the documents again. But that’s easy: simply follow the steps in the “Modeling the Data” section above.

We apply the previous aggregation query to get the following results—under the aggregations tag in response:

"aggregations": {
  "authors": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "Agatha Christie",
        "doc_count": 1
      },
      {
        "key": "Agatha christie",
        "doc_count": 1
      },
      {
        "key": "Ainsley Christie ",
        "doc_count": 1
      },
      {
        "key": "Jonathan Franzen",
        "doc_count": 1
      }
    ]
  }
}

In these new results, we can see a significant improvement over the previous results. But it’s still not accurate. Our aggregation is case sensitive: that’s the reason why “Agatha Christie” and “Agatha christie” get different treatment.

We need another method which will account for case-sensitivity.

More Accurate Results

Now we’ll make use of the keyword tokenizer, which will keep the text unchanged before it passes through any filter. After we apply this keyword tokenizer, we’ll use the lowercase filter on the same field. This filter will normalize the token text to lowercase.

After that, we’ll apply the analyzer to the author field, which will constrain Elasticsearch to treat “Agatha Christie” and “Agatha chiristie” as equivalent (“agatha christie”).

Again, we need to delete the index:

curl -X DELETE "http://localhost:9200/authors"

Now, let’s create the index again. This time, we apply our custom analyzer (in which the type is custom), which we give the name final:

 curl -X PUT "http://localhost:9200/authors" -d '{
  "analysis": {
    "analyzer": {
      "final": {
        "type": "custom",
        "tokenizer": "keyword",
        "filter": "lowercase"
      }
    }
  }
}'

The next step is to map our custom analyzer to the field we are intersted in (in this case “Authors”)

 curl -X PUT "http://localhost:9200/authors/famousbooks/_mapping" -d '{
  "famousbooks": {
    "properties": {
      "Author": {
        "type": "string",
        "analyzer": "final"
      }
    }
  }
}'

Now as done previously, re-index our documents to “authors” index under the type “famousbooks.”

After setting up the mapping and indexing the documents, now apply the previous terms aggregations query to the index. This will lead to the following response:

 "aggregations": {
  "authors": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "agatha christie",
        "doc_count": 2
      },
      {
        "key": "ainsley christie ",
        "doc_count": 1
      },
      {
        "key": "jonathan franzen",
        "doc_count": 1
      }
    ]
  }
}

Since this is the aggregations result we expect, this is a good place to conclude this article.

Conclusion

Here in this article, we’ve seen a demonstration of the default indexing mechanism in Elasticsearch and an exhibit of its limitations. We’ve also seen how to define custom analyzers and map them to specific field to obtain precise results.