In the previous blog in our analyzer series we learned, in detail, about the creation of an inverted index, the components of an analyzer, and watched a simple example of how to use those analyzer components as a single entity and analyze the input text.

Now, in this blog we will progress towards the application of analyzers by creating a custom analyzer with multiple components. Here we will have a look into analyzer design and more components that are crucial in making better search results with more accuracy and we will also include examples.

Analyzer Design

We have seen how three components, “character filter,” “tokenizer” and “token filters“, form the essential building blocks of the analyzers. Each of these elements have multiple components, which we can use in different methods to yield the desired results.

For example the “character filter” has different types of components, such as the “html_strip” filters (to filter out the HTML elements), and the “replacing filters” (to replace characters to text of our choice). There are also different types of “tokenizers” and “token filters”.

A general practice when it comes to practical applications is to define custom analyzers that have both custom filters/tokenizers and those provided by Elasticsearch. The general format for those custom analyzers is as below:

curl -X PUT "http://localhost:9200/testindex" -d '{
  "settings": {
    "analysis": {
      "char_filter": {
        custom character filters
      },
      "tokenizer": {
        custom tokenizers
      },
      "filter": {
        custom token filters
      },
      "analyzer": {
        custom analyzers
      }
    }
  }
}'

Using multiple filters to create a custom analyzer

In this section we will see some of the interesting and useful filters provided by Elasticsearch and our own custom filters to create a custom analyzer.

We will analyze the given text and:

  1. replace “&” occurrences in the text with the word “and” – using the “custom mapping character filter
  2. split the words at punctuation – using the “standard” tokenizer
  3. make the search case insensitive – using the “lowercase token filter”
  4. provide better search results by catching the stemming words at its root (the words running and runs have the same root word: run) – using the “stemmer” token filter
  5. stop the words “the,” “an,” “a,” and “is,” from being indexed into our inverted index- by using the “stopword” filter

So let’s have a look at the filters used in each step first and then proceed to the settings for mapping.

Custom mapping character filter

We know the input text arrives to the character filter as a string. What we need to do here is to replace all the “&” character occurrences in this string to “and.” So we define a filter with a custom name “replace-&” under character filter, which will map “&” to “and” like below:

"char_filter": {
  "replace-&": {
    "type": "mapping",
    "mappings": [
      "&=> and "
    ]
  }
}

Standard tokenizer

As mentioned in the previous blog, the standard analyzer will break the string to words whenever it encounters a punctuation or white space. The settings for this can be given as
tokenizer : “standard”

Lowercase token filter

After tokenizing, the terms are lower cased in order to make our search case insensitive.
"filter" : ["lowercase"]

We have given an array structure to filter field, because we have multiple term filters to be used here.

Stemmer Token Filter

Stemming is a very useful concept when it comes to populating the inverted index. There are many words like “learn,” “learned,” and “learning,” which means the same but are represented in different tenses. To index all of these would be an awful waste of space in our inverted index because we need to index only the actual word for any of its tenses. This is exactly what the stemmer token filter does.

By employing this, if we search for “learn” the documents containing all three words “learn,” “learned,” and “learning” will be shown in the results.

To employ stemming, we have two main approaches:

  • Algorithmic approach: use a generic language based algorithm to convert words to their stems or base forms.
  • Dictionary based approach: use a look-up mechanism to map the word to its stem.

Here we are going to use “snowball” which is an algorithm to find the stem of the words. This means that it intelligently removes the trailing characters such as “ed,” “ing,” and so on.

This algorithmic filter is provided by Elasticsearch so that we can directly call it under the filter field in the settings. Now our filter field in the settings would become as follows:
"filter" : ["lowercase" "snowball"]

Stopword token filter

In most text searches there is no place for the common words like “the,”  “an,” “is,” etc. This search will unnecessarily populate our index similar to the previous case of stemmed words.  So in order to solve this issue we have another token filter called “stopwords”, which will prevent the indexing of such common words (or stopwords).

"filter": {
  "custom_stopwords": {
    "type": "stop",
    "stopwords": [
      "the",
      "an",
      "a",
      "is",
      "to"
    ]
  }
}

As you can see, we have defined a custom filter with the name "custom_stopwords" and under it we have specified which words should be removed from the tokens in the stopwords array.
If we don’t specify the stopwords array, Elasticsearch will take the default stopwords for English from a predefined list.

Custom Analyzer Set-up

We have seen how to define the tokenizer and all the four filters in the settings. Now, let’s see how to combine all of these and create a custom analyzer.

curl -X PUT "<a href="http://localhost:9200/testindex">http://localhost:9200/testindex</a>" -d '{
  "settings": {
    "analysis": {
      "char_filter": {
        "replace-&": {
          "type": "mapping",
          "mappings": [
            "&=> and "
          ]
        }
      },
      "tokenizer": "standard",
      "filter": {
        "custom-stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "an",
            "a",
            "is",
            "to"
          ]
        }
      },
      "analyzer": {
        "first-custom-analyzer": {
          "type": "custom",
          "char_filter": [
            "replace-&"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "snowball",
            "custom-stopwords"
          ]
        }
      }
    }
  }
}'

From the above settings we can see that we have created a custom analyzer whose name is "first-custom-analyzer." It uses 4 filters and one tokenizer for its operation.

Now let’s check the working of the filter and give a sample text to this analyzer. Since the "first-custom-analyzer" is not a global one we need to specify the index when we use the analyze API for testing.

Let our sample text be: “John & Smith went running. John runs faster than Smith, which is going to make him the winner”

curl -XGET "localhost:9200/testindex-11112/_analyze?analyzer=first-custom-analyzer&pretty=true" -d 'John & Smith, went for running. John runs faster than Smith,which is going to make him the winner'

Now looking at the response in the terminal, we can see the tokens generated. The following table shows the position number and token values taken from the response:

Blog2_Analyzers.png#asset:780

From the above token table we can see that

  1. & is replaced by “and” (position number 2 in the table)
  2. tokenization is done at whitespaces and at punctuations
  3. every token is lowercased
  4. the words “running” and “runs” is tokenized to its stem word “run” (position number 6 and 8)
  5. the words specified in the stopwords list are not found in the results. Here we have “is” and “the” in our input text position numbers 13 and 18. But in the generated token list in the response they are not found with their position numbers being ignored.

Conclusion

In this part of the “analyzer” series we have seen how to set-up and use custom analyzers. Also, we have become familiar with concepts like stemming and filters such as “snowball” and “stopwords”. In the next blog we will see a little more advanced analyzers like “edge-n-gram” and “phonetic” which empowers us to do much more interesting searches.