In Part I of this overview, we'll explore the Qbox plugins for morphological and phonetic analysis, tokenization and concatenation, and native scripting, among others. By the end of this review, you'll have a better understanding of what plugins you might wish to install on your Qbox-hosted cluster.

Elasticsearch plugins are designed to improve the core ES functionality in a custom way. In particular, plugins allow creating custom mappings, language tokenizers, and analyzers, and they enable native scripting and integrations with diverse third-party software and services.

Qbox-hosted Elasticsearch ships with a wide variety of core and community-contributed Elasticsearch plugins. You can choose which plugins to add to your cluster during Elasticsearch installation by checking the corresponding boxes in the plugin list (see the image below). As a user of Qbox services, you might be interested in the purposes and features of these plugins and how they can add value to your Qbox cluster. That's precisely what we are after.

In Part I of this article, we'll review the plugins for morphological and phonetic analysis, tokenization and concatenation, native scripting, and several others. By the end of this review, you'll have a better understanding of whicht plugins you might wish to install on your Qbox-hosted cluster.

Elasticsearch plugins in Qbox


International Components for Unicode (ICU) Analysis Plugin

This plugin integrates the Lucene ICU module, adding extended Unicode support, including Unicode normalization, collation, and transliteration, Unicode-aware case folding, and better analysis of Asian languages.

To illustrate how the plugin works, we'll look at ICU folding token filter, which is a core part of its functionality. The filter implements case folding invented to solve the problem of case-insensitive comparison for non-ASCII characters. The thing is that a trivial mapping from [A-Z] to [a-z] works only for simple ASCII-only texts. However, it fails when we work with languages that have additional non-ASCII characters. For example, converting some Unicode characters like a German letter ß to the upper case will yield SS, which if converted again to lowercase yields ss -- not what we would expect. Case folding helps avoid this inconsistency of conventional lowercasing and uppercasing of some Unicode characters. It allows converting characters in a lowercase form, which might not result in the correct spelling but which will enable case-insensitive comparisons of characters.

In the example below, we define the ICU folded token filter with an additional filter for the Swedish language. The filter is defined with unicodeSetFilter parameter, which specifies Swedish letters that should not be folded (note that we must specify both uppercase and lowercase forms of these letters). Also, we should ensure that these characters are not lowercased, which is why we also add a lowercase filter.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "swedish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "swedish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "swedish_folding": {
            "type": "icu_folding",
            "unicodeSetFilter": "[^åäöÅÄÖ]"
          }
        }
      }
    }
  }
}

To study other features of the ICU plugin like collation, character normalization, and tokenization, please refer to the plugin's official ICU documentation.

Phonetic Analysis Plugin

This plugin ships with filters that convert tokens to their phonetic representation using such algorithms as Metaphone (default), Soundex, Caverphone, Cologne phonetics, and more.

For example, the Metaphone algorithm invented by Lawrence Philips in 1990 allows indexing words by their English pronunciation. It converts tokens to their phonetic representation using simple conversion rules. For example, "th" is converted to "0", "CK" to "K," "PH" to "F," "Q" to "K," and so on, depending on English phonetic rules. As a result, the algorithm allows similar-sounding words to share the same keys/tokens, which simplifies their comparison.

Let's look at the example of Phonetic Analysis plugin implementation using the built-in Metaphone algorithm. In the code below, we define a custom my_metaphone filter with the encoder parameter specifying a type of the algorithm to use ("metaphone" in our case) and the replace parameter defining whether the original token should be replaced by its phonetic version or not.

PUT phoneticsample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "myanalyzer": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "mymetaphone"
            ]
          }
        },
        "filter": {
          "mymetaphone": {
            "type": "phonetic",
            "encoder": "metaphone",
            "replace": true
          }
        }
      }
    }
  }
}
GET phoneticsample/analyze
{
  "analyzer": "my_analyzer",
  "text": "Russell Crowe" 
} 

Given the definition above, the response value for "Russell Crowe" returned by the plugin will be the following:

{
  "tokens": [
    {
      "token": "RSL",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "KRW",
      "start_offset": 8,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

In this way, we can obtain a working phonetic representation for the fields stored in our Elasticsearch indices.

Combo Analysis Plugin

The plugin allows combining the output of many analyzers into one. This may be useful when you can't detect the language or the stemming procedure deforms indexed terms, or when you want to use other languages during the search while being able to search original words stored in another language. In these cases, the plugin will help you to store original and stemmed words together in your index by merging terms received from multiple analyzers. Combo Analysis ships with a combo analyzer type and allows defining additional analyzers under sub_analyzers property.

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "type" : "custom",
                    "tokenizer" : "icu_tokenizer",
                    "filter" : [ "snowball", "icu_folding" ]
                },
                "combo" : {
                    "type" : "combo",
                    "sub_analyzers" : [ "standard", "default" ]
                }
            },
            "filter" : {
                "snowball" : {
                    "type" : "snowball",
                    "language" : "German2"
                }
            }
        }
    }
}

Decompound Plugin

Compounding several words into one word is characteristic of German, Finnish, and Scandinavian languages, as well as Korean. However, sometimes it may be useful to break them into individual pieces for better analysis and processing. This is where the Decompound plugin shines. Unlike Lucene, which requires loading word lists in memory before decompounding them, the Decompound plugin can process compound terms out of the box. In Qbox, the plugin is compatible with the Elasticsearch 2.3.4 and earlier versions.

Given the plugin definition, you can decompound words in a German sentence the following way:

PUT /decompound/ex/1
{
    "text" : "Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet"
}
POST /decompound/ex/_search?explain
{
    "query": {
        "match": {
           "text": "dampf schiff"
        }
    }
} 

The German words specified above will be decompounded into smaller tokens like "Recht," "Donau," "dampf," etc. identified in the compound words.

Concatenate Token Filter

This plugin supported in Elaticsearch 2.x allows merging tokenized strings into a single one. For example, setting token_separator of the filter to "-", we can concatenate the phrase "Elasticsearch is fun" into a single string "Elasticsearch-is-fun". We can also use a stopwords parameter to exclude articles, conjunctions, and other parts of speech we don't want to concatenate.

In the example below, we've created a custom analyzer based on the concatenate filter provided by the plugin. We define "_" as a separator and use "and", "is","the" as stopwords.

{
  "analysis" : {
    "filter" : {
      "concatenate" : {
        "type" : "concatenate",
        "token_separator" : "_"
      },
      "custom_stop" : {
        "type": "stop",
        "stopwords": ["and", "is", "the"]
      }
    },
    "analyzer" : {
      "stop_concatenate" : {
        "filter" : [
          "custom_stop",
          "concatenate"
        ],
        "tokenizer" : "standard"
      }
    }
  }
}

After saving this analyzer to our index, we can use it to convert tokenized strings like "now you know how to use the concatenation filter" into a concatenated string formatted according to the rules we've specified: "now-you-know-how-to-use-concatenation-filter". This functionality might be useful for designing attractive URLs for web-based applications.

URL Tokenizer/Token Filter

URL Tokenizer tokenizes URLs into meaningful parts according to the specified rules. You can select a part of URL like host, port, path, query, ref to be tokenized or combine them. The plugin also supports URL decoding which may be enabled by setting url_decode to true in the plugin's configuration.

We can define the URL tokenizer as a custom analyzer like this:

{
    "settings": {
        "analysis": {
            "tokenizer": {
                "url_host": {
                    "type": "url",
                    "part": "host"
                }
            },
            "analyzer": {
                "url_host": {
                    "tokenizer": "url_host"
                }
            }
        }
    }
}

Here, we set a part of URL to be tokenized to host and point a new analyzer to our plugin. After saving this definition, the plugin can be accessed at the _analyze entry point of our index like this:

curl 'http://QBOX-HOST/index_name/_analyze?analyzer=url_host&pretty' -d 'https://qbox.tutorial.com/qbox-plugins.html'

For example, the above request should return the following response:

{ "tokens":[{
  "token":"qbox.tutorial.com",
  "start_offset":8,
  "end_offset":25,
  "type":"host",
  "position":0
  },
  {
  "token":"tutorial.com",
  "start_offset":13,
  "end_offset":25,
  "type":"host",
  "position":1
   },
   {
  "token":"com",
  "start_offset":22,
  "end_offset":25,
  "type":"host",
  "position":2}
   ]}

We see that the URL tokenizer has extracted three tokens: qbox.tutorial.com, tutorial.com, and com which all relate to the specified host name.

Ingest Attachment Processor Plugin

This plugin that replaced Mapper Attachment Plugin in Elasticsearch > 5.0.0 allows extracting file attachments in such popular formats as PPT or PDF using the Apache Tika library. The library can detect and extract metadata and text from numerous file types. The ability to parse metadata such as content type, language, and content length makes this plugin useful for search engine indexing, translation, and content analysis.

In the example below, we define the Ingest Attachment Processor plugin with the "field" parameter that specifies the field to extract base64 encoded data from (note that the plugin accepts attachments only in this encoding).

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment metadata",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT ingest/example/1?pipeline=attachment
{
  "data": "VGhpcyB0ZXh0IGlzIHRvIHRlc3QgdGhlIEluZ2VzdCBBdHRhY2htZW50IFByb2Nlc3NvciBQbHVnLWlu"
}
GET ingest/example/1

To test the plugin, we've created a plain text file with a a dummy text, converted it to base64 encoding, and saved it to our ingest index as an attachment. When we GET the encoded data, the Ingest plugin automatically extracts the file's metadata and content, returning the following response:

{
  "_index": "ingest",
  "_type": "example",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "data": "VGhpcyB0ZXh0IGlzIHRvIHRlc3QgdGhlIEluZ2VzdCBBdHRhY2htZW50IFByb2Nlc3NvciBQbHVnLWlu",
    "attachment": {
      "content_type": "text/plain; charset=ISO-8859-1",
      "language": "en",
      "content": "This text is to test the Ingest Attachment Processor Plug-in",
      "content_length": 61
    }
  }
}

You can see the full list of properties that can be extracted by the plugin in the ingest index mapping in Kibana.

Ingest  index mapping

JavaScript and Python Plugins

The JavaScript plugin enables native scripting with JavaScript in Elasticsearch queries using Mozilla's Rhino JavaScript engine. One of the use cases for the JavaScript is making computations and transformations with data returned by the query. 

In the example below, we extract the square root from the returned num field using Math.sqrt() JavaScript built-in function. This example uses inline scripting that should be enabled in the elasticsearch.yml for this example to work.

PUT js/ex/1
{
  "num": 16
}
GET js/_search
{
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "inline": "Math.sqrt(doc[\"num\"].value)",
          "lang": "javascript"
        }
      }
    }
  }
}

The script acts as a middleware that processes the response and outputs the final result in the _score field:

! Deprecation: [javascript] scripts are deprecated, use [painless] scripts instead
{
  "took": 1,
  "timedout": false,
  "shards": {
    "total": 4,
    "successful": 4,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "maxscore": 4,
    "hits": [
      {
        "index": "js",
        "type": "ex",
        "id": "1",
        "score": 4,
        "source": {
          "num": 16
        }
      }
    ]
  }
} 

As you see, the computed value "4" is stored in the _score field returned by Elasticsearch.

Aside from JavaScript, Qbox also supports the Python plugin that allows scripting in the popular Python language. In the example below, we use Python plugin to multiply the returned number by 2 and display the result in the _score field.

GET js/_search
{
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "inline": "doc[\"num\"].value * factor",
          "lang": "python",
          "params": {
            "factor": 2
          }
        }
      }
    }
  }
}

The query above returns the following response with the computed value 32 in the _score field.

"hits": [
      {
        "_index": "js",
        "_type": "ex",
        "_id": "1",
        "_score": 32,
        "_source": {
          "num": 16
        }
      }
    ]

Note that since Elasticsearch 5.0.0, both plugins are deprecated and all third-party scripting languages are replaced by the ES default scripting language Painless.

Conclusion

As we see, plugins provided by Qbox enable a wide variety of features like phonetic and morphological text analysis, native scripting in JavaScript and Python, extraction of file metadata, tokenization, concatenation, and case folding, all of which dramatically improve search, analysis, and transformation of your Elasticsearch data.

In Part 2 of the article, we'll continue this overview by focusing on Qbox-hosted language plugins and various solutions providing integrations with third-party libraries, languages, and software like Couchbase, SQL, and Neo4j among others.