Computers, fundamentally, just deal with numbers. They store letters and other characters by assigning a number for each one.

Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters. For example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding is adequate for all the letters, punctuation, and technical symbols in common use.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is the single universal character set for text that enables the interchange, processing, storage and display of text in many languages. The Unicode standard serves as a foundation for the globalization of modern software. Making use of software that supports Unicode to develop or run business applications will enable us to reduce our development and deployment time and costs, enabling us to expand into new markets more quickly.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

ICU is a mature widely-used set of C/C++ and Java libraries that provide Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

A few highlights of the services provided by ICU:

  • Code Page Conversion: Convert text data to or from Unicode and nearly any other character set or encoding.
  • Collation: Compare strings according to the conventions and standards of a particular language, region, or country.
  • Formatting: Format numbers, dates, times and currency amounts according to the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc.
  • Time Calculations: Multiple types of calendars are provided beyond the traditional Gregorian calendar.
  • Unicode Support: ICU closely tracks the Unicode standard, providing easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding etc.
  • Regular Expression: ICU’s regular expressions fully support Unicode while providing very competitive performance.
  • Bidi: It supports handling of text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.
  • Text Boundaries: Locate the positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.

The Elasticsearch ICU Plugin

The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.

The analyzer anatomy of ICU Plugin consists of:

  • Character Filter: Normalization Character Filter
  • Tokenizer: ICU Tokenizer
  • Token Filter: Normalization Token Filter, Folding Token Filter, Collation Token Filter, Transform Token Filter

ICU Normalization Character Filter

Normalization is used to convert text to a unique equivalent form. ICU can normalize equivalent strings to one particular sequence, such as normalizing composite character sequences into precomposed characters.

It registers itself as the icu_normalizer character filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default). Set the mode parameter to decompose to convert nfc to nfd or nfkc to nfkd respectively.

curl -XPUT 'localhost:9200/icu_analyzer?pretty' -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { 
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "icu_normalizer"
            ]
          },
          "nfd_normalized": { 
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfd_normalizer"
            ]
          },
          "nfkc_normalized": { 
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfkc_normalizer"
            ]
          }
        },
        "char_filter": {
          "nfd_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc",
            "mode": "decompose"
          }
        }
      }
    }
  }
}'

ICU Tokenizer

This service tokenizes text into words on word boundaries, as defined in Unicode Text Segmentation. A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries.

It behaves much like the standard tokenizer but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and by using custom rules to break Myanmar and Khmer text into syllables.

curl -XPUT 'localhost:9200/icu_analyzer?pretty' -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "custom_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}'

ICU Token Filters

ICU Normalization Token Filter

This one is same as ICU Normalization character filter and must not be preferred over ICU character filter. The only difference is due to the anatomy of an analyzer. The Normalization character filter comes into play before tokenisation, whereas the token filter acts post tokenisation.

ICU Folding Token Filter

Case folding maps strings to a canonical form where case differences are erased. Using the case folding API, ICU supports fast matches without regard to case in lookups because only binary comparison is required.

The ICU folding token filter already does Unicode normalization, so there is no need to use normalize character or token filter as well.

curl -XPUT 'localhost:9200/icu_analyzer?pretty' -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "folded": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_folding"
            ]
          }
        }
      }
    }
  }
}'

Foldable letters can be controlled by specifying the UnicodeSet using unicodeSetFilter parameter. A unicode set is a mutable set of Unicode characters and multicharacter strings. It represents character classes used in regular expressions. A character specifies a subset of Unicode code points. Legal code points are U+0000 to U+10FFFF, inclusive.

ICU Collation Token Filter

Information is displayed in sorted order to enable users to easily find the items for which they are looking. However, users of different languages might have very different expectations of what a “sorted” list should look like. Not only does the alphabetical order vary from one language to another, but it also can vary from document to document within the same language. For example, phone book ordering might be different than dictionary ordering. String comparison is one of the basic functions most applications require, and yet implementations often do not match local conventions. The ICU Collation Service provides string comparison capability with support for appropriate sort orderings for each of the locales.

Collations are used for sorting documents in a language-specific word order. The icu_collation token filter is available to all indices and defaults to DUCET collation, which is a best-effort attempt at language-neutral sorting. DUCET or Default Unicode Collation Element Table defines the default sort order for all Unicode characters, regardless of language.

Below is an example of how to set up a field for sorting German names in “phone book” order:

curl -XPUT 'localhost:9200/icu_collation_analyzer?pretty' -d '{
  "settings": {
    "analysis": {
      "filter": {
        "german_phone_book": {
          "type":     "icu_collation",
          "language": "de",
          "country":  "DE",
          "variant":  "@collation=phonebook"
        }
      },
      "analyzer": {
        "german_phone_book": {
          "tokenizer": "keyword",
          "filter":  [ "german_phone_book" ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "title": { 
          "type": "string",
          "fields": {
            "sort": { 
              "type":"string",
              "analyzer": "german_phone_book"
            }
          }
        }
      }
    }
  }
}'

Let’s index a few documents and try a search query using collated sort:

curl -XPUT localhost:9200/icu_collation_analyzer/user/_bulk -d '
{ "index": { "_id": 1 }}
{ "title": "Green" }
{ "index": { "_id": 2 }}
{ "title": "Goffey" }
{ "index": { "_id": 3 }}
{ "title": "gailey" }
{ "index": { "_id": 4 }}
{ "title": "göhm" }'
curl -XGET localhost:9200/icu_collation_analyzer/user/_search?sort=title.sort

The response is as follows:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "icu_collation_index",
        "_type": "user",
        "_id": "3",
        "_score": null,
        "_source": {
          "title": "gailey"
        },
        "sort": [
          "᪔乏昫တ倈⠀\u0001"
        ]
      },
      {
        "_index": "icu_collation_index",
        "_type": "user",
        "_id": "2",
        "_score": null,
        "_source": {
          "title": "Goffey"
        },
        "sort": [
          "᪢䳌昫တ倎瀤\u0000\u0000"
        ]
      },
      {
        "_index": "icu_collation_index",
        "_type": "user",
        "_id": "1",
        "_score": null,
        "_source": {
          "title": "Green"
        },
        "sort": [
          "᪥䱌⡠႐໠ \u0001"
        ]
      }
    ]
  }
}

Our docs are in the order: gailey, göhm, goffey,green. In the German phone book collation, göhm is the equivalent of goehm, which comes before goffey.

ICU Transform Token Filter

General transforms provide a general purpose package for processing Unicode text. They are a powerful and flexible mechanism for handling a variety of different tasks, including:

  1. Uppercase, Lowercase, Titlecase, Full/Halfwidth conversions
  2. Normalization
  3. Hex and Character Name conversions
  4. Script to Script conversion

Transforms were designed to convert characters from one script to another (for example, from Greek to Latin, or Japanese Katakana to Latin). However, ICU Transforms now include prebuilt transformations for case conversions, normalization conversions, removal of given characters, and also for a variety of language and script transliterations. Transforms can be chained together to perform a series of operations, and each step of the process can use a UnicodeSet to restrict the characters that are affected.

For example, to remove accents from characters, use the following transform:

NFD; [:Nonspacing Mark:] Remove; NFC.

Let’s consider an icu_analyzer that transforms transliterated characters to Latin and separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.

curl -XPUT 'localhost:9200/icu_transform_analyzer?pretty' -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "latin": {
            "tokenizer": "keyword",
            "filter": [
              "custom_latin_transform"
            ]
          }
        },
        "filter": {
          "custom_latin_transform": {
            "type": "icu_transform",
            "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" 
          }
        }
      }
    }
  }
}'

Let’s try our custom_latin_transform on few japanese and greek keywords:

curl -XGET 'localhost:9200/icu_transform_analyzer/_analyze?analyzer=latin&pretty' -d '{
  "text": "キャンパス" 
}'
Returns kyanpasu
curl -XGET 'localhost:9200/icu_transform_analyzer/_analyze?analyzer=latin&pretty' -d '{
  "text": "Αλφαβητικός Κατάλογος" 
}'
Returns Alphabetikos Katalogo
curl -XGET 'localhost:9200/icu_transform_analyzer/_analyze?analyzer=latin&pretty' -d '{
  "text": "биологическом" 
}'
Returns biologiceskom

Conclusion

Qbox provisioned Elasticsearch Service provides built-in integration with the ICU Analysis plugin. We can use the ICU Analyzer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures consistent treatment for these languages.

Individual terms are filtered using the ICU Folding and Normalization filters to ensure consistent terms. By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

There are a few similar plugins available for other Asian languages like Kuromoji plugin for Japanese built on top of the excellent library by Atilika; Smart Chinese Analysis plugin for Chinese; Stempel Analysis Plugin for Polish; and Ukrainian Analysis plugin for Ukrainian. Unfortunately, there is currently no official Korean analyzer for Elasticsearch. However, a Korean analyzer is in the process of being ported into Lucene that will eventually end up as an ES plugin.

Other Helpful Tutorials

Give It a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.