We have already discussed the “Langdetect Ingest Plugin” in a previous post. We now focus on the “elasticsearch-langdetect” or the “Nakatani Shuyo’s language detector” in this post.

The “elasticsearch-langdetect” plugin offers a mapping type to specify fields where we want to enable language detection. Detected languages are indexed into a subfield of the field named ‘lang’. The field can be queried for language codes.

We can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin also offers a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

Keep reading

Text segmentation has always been very critical from the perspective of Search. It is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Word segmentation is the task of dividing a string of written language into its component words. In English and many other languages, using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). Some examples where the space character alone may not be sufficient include contractions like won’t for will not.

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited

Keep reading