We have already discussed the “Langdetect Ingest Plugin” in a previous post. We now focus on the “elasticsearch-langdetect” or the “Nakatani Shuyo's language detector” in this post.

The “elasticsearch-langdetect” plugin offers a mapping type to specify fields where we want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang'. The field can be queried for language codes.

We can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin also offers a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

Keep reading

Text segmentation has always been very critical from the perspective of Search. It is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Word segmentation is the task of dividing a string of written language into its component words. In English and many other languages, using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). Some examples where the space character alone may not be sufficient include contractions like won't for will not.

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited

Keep reading

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? Discover how easy it is to manage and scale your Elasticsearch environment.

Get Started 5 minutes to get started

Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Keep reading

Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Keep reading

Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Keep reading

Painless uses a Java-style syntax that is similar to Groovy. In fact, most Painless scripts are also valid Groovy, and simple Groovy scripts are typically valid Painless. (This specification assumes you have at least a passing familiarity with Java and related languages.)

Painless is essentially a subset of Java with some additional scripting language features that make scripts easier to write. However, there are some important differences, particularly with the casting model. For more detailed conceptual information about the basic constructs that Java and Painless share, refer to the corresponding topics in the Java Language Specification.

Painless scripts are parsed and compiled using the ANTLR4 and ASM libraries. Painless scripts are compiled directly into Java byte code and executed against a standard Java Virtual Machine. This specification uses ANTLR4 grammar notation to describe the allowed syntax. However, the actual Painless grammar is more compact than that shown here. Painless is a simple and secure scripting language designed specifically for use with Elasticsearch. It is the default scripting language for Elasticsearch and can safely be used for inline and stored scripts.

Keep reading