In this post, we discuss elasticsearch analyzers. Creation and configuration of analyzers are the main steps to increasing search efficiency.  They are used for adding elements and search.

The main goal of any analyzer is splitting a stream of characters which are overloaded with unnecessary details. They squeeze out the needed information and produce a list of tokens that reflect it. Let’s look at its structure.

Analyzers

Analyzers consist of:

  1. Character filters. These are used in the first stage, before processing, to remove all unnecessary characters. The great option is that you may not specify your own filter and standard one will be used as default.
  2. Tokenizer. It splits text into words, phrases, or character sets by a special character or a set. It should always be specified.
  3. Token filters. These filters can modify tokens (e.g. fetching to lower case), delete some of them (“stopwords” list) or add a new one (list of used synonymous). There are not mandatory.

Elasticsearch also provides a range of standard analyzers and many filters. Here we cover a few types of analyzers. Other analyzers can be checked in documentation.

  1. Standard analyzer. This analyzer is built using the Standard Tokenizer with the Standard Token Filter, Lowercase Token Filter and Stop Token Filter. As a result of its work we will receive the search query. This splits the words, most of the punctuation is removed, and all tokens are fetched to lower case.
  2. Simple analyzer. It splits the text into words by any non-alphabetic characters and fetches tokens to lowercase.
  3. Whitespace analyzer. It just splits the text into words by space characters.
  4. Language analyzers. It splits the text into words, filters out “junk” and cuts off word endings considering the morphology of the current language. Available for many languages, which are specified in the documentation.
  5. Snowball analyzers. It uses the standard tokenizer with standard filter, lowercase filter, stop filter, and snowball filter. It identifies the root word and then uses it for the search.
  6. Custom analyzers. You may create your own analyzer, of which consist of filters and tokens you require.

All the analyzers support specifying custom lists with stop words, either in the configuration file or by using external file of "stopwords" by setting the "stop_words_path".

Related Blog Post: Introduction to Elasticsearch Analyzers

Tutorial

Let’s try to apply this knowledge and create an index. To create an index, we need to describe all the analyzers. First, we use the standard analyzer.

curl -XPUT 'localhost:9200/recipes?pretty' -d   '{
         "settings": {
        "analysis": {
            "analyzer": {
                "standard": {
 "type": "standard"
                } } } }                }'

Let’s fill our index with recipe documents. You can use our example below, or write your own recipes.

curl -XPUT 'localhost:9200/recipes/recipe/1?pretty' -d '{ 
       "title" :"Fresh Orange Pie",
        "ingredients" : "pastry dough,   navel orange, sugar, vanilla, coconut",
        "description" : "Meanwhile, peel orange and section into segments...",
        "categories" : ["pie", "orange"] 
  }'
curl -XPUT 'localhost:9200/recipes/recipe/2?pretty' -d '{ 
         "title" :"Perfect Apple Pie",
         "ingredients" : "refrigerated pie crusts,  apple, sugar, all-purpose flour, cinnamon, salt, lemon juice",
          "description" : "Heat oven to 425°F. Place 1 pie crust in ungreased 9-inch glass pie plate. Press firmly against side and bottom...",
        "categories" : ["pie", "apple"]
    }'
curl -XPUT 'localhost:9200/recipes/recipe/3?pretty' -d '{
        "title" : "Apples Jam",
         "ingredients" : "apples, sugar, vanilla",
          "description" : "Bring sugar and 3 Tbsp. water to a boil in a large pot over medium-high heat, stirring to dissolve sugar….",
        "categories" : ["jam", "apples"]
    }'

Now we test our analyzer. For example, we’ve got a lot of oranges and we want something to cook from them. Find all the recipes containing oranges.

curl -XGET 'localhost:9200/recipes/_search?q=oranges'
{{"hits":{"total":0,"max_score":null ...}}

Why did we not get anything, even though we have a recipe with oranges? Go back to the document with id = 1 and read this recipe. You can see that it doesn’t contain the word "oranges", but "orange". The standard analyzer could not find this recipe. Let’s search another way:

curl -XGET 'localhost:9200/recipes/_search?q=orange' 
{{ hits":[{"_index":"recipes","_type":"recipe","_id":"1","_score":0.11626227,"_source":{ 
       "title" :"Fresh Orange Pie",
        "ingredients" : "pastry dough,   navel orange, sugar, vanilla, coconut",
        "description" : "Meanwhile, peel orange and section into segments...",
        "categories" : ["pie", "orange"]  }}

Now we get the expected result. Why did it work this time? If we return to the description of the standard analyzer, we will see that it is looking for exact matches of the words. What should we do if we need a search that will understand us? Create your own analyzer.

Singular and Plural Forms of Words

Write an analyzer that will be able to find a recipe regardless of whether we entered "orange" or "oranges". The easiest way to solve this problem is to use an existing analyzer. Let’s try to use the "snowball" analyzer that was considered above. To create an index, we have to write down:

curl -XPUT 'localhost:9200/recipes?pretty' -d '
{         "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
 "type": "snowball"
                } } } }                }'

We should specify document type and fields. This can be done using special mapping. We did not do that for the standard analyzer because Elasticsearch can generate mapping dynamically. However, it may contain errors and is inappropriate for us.

Tutorial: Auto-Scaling Kubernetes Cluster on AWS EC2 with Supergiant

Creating your custom mapping allows you to specify a type of analyzer used in the search, and describe all document fields with their priority. This is how we’ve increased "weight" in two times for "categories" field and excluded "id" field from search.

curl -XPUT 'localhost:9200/recipes/recipe/_mapping' -d '{
    "recipe": {
      "_all" : {"enabled" : true, "analyzer": "my_analyzer", "search_analyzer": "my_analyzer"},
      "properties": {
        "id": {
          "type": "string",
          "index": "not_analyzed"        },
        "title": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"        },
        "ingredients": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"       },
        "description": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"        },
        "categories": {
          "type": "string",
          "boost": 2.0,
          "index": "analyzed",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"        }
            }    } }'

Now we are going to test our analyzer by adding a new query to the previous request.  The same recipes will be used.

curl -XGET 'localhost:9200/recipes/_search?q=oranges'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"1","_score":0.16781014," … }}
curl -XGET 'localhost:9200/recipes/_search?q=categories:oranges'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"1","_score":0.76713204,"...}}
curl -XGET 'localhost:9200/recipes/_search?q=orange'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"1","_score":0.16781014," … }}
curl -XGET 'localhost:9200/recipes/_search?q=apples'
{{ hits":[{"_index":"recipes","_type":"recipe","_id":"3","_score":0.13287118,"....
{"_index":"recipes","_type":"recipe","_id":"2","_score":0.11072597," …}}
curl -XGET 'localhost:9200/recipes/_search?q=categories:apples'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"2","_score":0.76713204,"......
{"_index":"recipes","_type":"recipe","_id":"3","_score":0.76713204,".....  }}
curl -XGET 'localhost:9200/recipes/_search?q=apple'
{{ hits":[{"_index":"recipes","_type":"recipe","_id":"3","_score":0.13287118,"....
{"_index":"recipes","_type":"recipe","_id":"2","_score":0.11072597," …}}

As you can see, we’ve found all the recipes regardless of the word form we’ve entered. Try to write your own query and make sure.
Problems with this analyzer may occur, for example, if you want to find recipes that use the word "milk". It will also find recipes with the word "milky", and it is not quite accurate.

The second solution of the problem is to create an analyzer using "porter_strem" filter. It uses the Porter algorithm to cut word endings and suffixes and is similar to the "snowball" filter. Please note that you have to use the lowercase filter before it!

Compared to the first example with a snowball, we will change only the part describing analyzer. The mapping and data will remain the same.

curl -XPUT 'localhost:9200/recipes?pretty' -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "filter": ["standard", "lowercase", "stop", "porter_stem"]
                }            }
             } } }'

Let’s test it.

curl -XGET 'localhost:9200/recipes/_search?q=oranges'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"1","_score":0.16781014," … }}
curl -XGET 'localhost:9200/recipes/_search?q=categories:oranges'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"1","_score":0.76713204,"...}}
curl -XGET 'localhost:9200/recipes/_search?q=orange'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"1","_score":0.16781014," … }}
curl -XGET 'localhost:9200/recipes/_search?q=apples'
{{ hits":[{"_index":"recipes","_type":"recipe","_id":"3","_score":0.13287118,"....
{"_index":"recipes","_type":"recipe","_id":"2","_score":0.11072597," …}}
curl -XGET 'localhost:9200/recipes/_search?q=categories:apples'
{{ "hits":[{"_index":"recipes","_type":"recipe","_id":"2","_score":0.76713204,"......
{"_index":"recipes","_type":"recipe","_id":"3","_score":0.76713204,".....  }}
curl -XGET 'localhost:9200/recipes/_search?q=apple'
{{ hits":[{"_index":"recipes","_type":"recipe","_id":"3","_score":0.13287118,"....
{"_index":"recipes","_type":"recipe","_id":"2","_score":0.11072597," …}}

As you can see, obtained results are similar to the previously received. That means the current filter is suitable for us.
There are a few disadvantages to this filter. Since it is based on morphology tables, all word forms must be listed in them and any new word will be a problem. For languages with a simple morphology, such as English, these tables will not be large. However, inflected languages, which have many different word roots, may be a problem.

Conclusion

We have considered just a small amount of the issues related to the analyzers in Elasticsearch. We’ve discussed only the simplest methods for solving problems with search queries using single or plural forms of a word. Read, look, and invent your own analyzers! Questions/Comments? Drop us a line below.